The Global Alliance for Genomic Health (GA4GH) creates a common framework that enables responsible, voluntary and secure sharing of clinical and genomic data. A perspective from its Clinical Working Group Cancer Task Team, published in the Nature Medicine, highlights the challenges that a global clinical and genomic data-sharing approach presents, suggest potential solutions and highlight key initiatives to foster these activities in the molecular-profiling landscape.
The sharing of aggregated data has become a substantial rate-limiting step in the development of cancer prevention and treatment strategies. Although the importance of open access to genomic information is clearly recognised, multiple technical and logistical barriers for effective data sharing persist, including data non-comparability, coding heterogeneity, difficulties in the storage and transfer of large data sets and non-standardised bioinformatics analyses. Additionally, regulatory, legal and ethical processes are not designed for global data sharing and thus require urgent attention.
Genomic data in oncology: challenges
In an effort to resolve the lack of standardised phenotypic-variation descriptors in malignancy, particularly in the genomic era, a task team of the GA4GH Clinical Working Group is developing approaches to support the alignment of and mapping across ontologies in cancer by leveraging specialist resources, such as those provided by the National Cancer Institute Thesaurus and the Human Phenotype Ontology.
Genomic-data sharing in cancer research has been successful within large consortia such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC). Databases such as the Cancer Genomics Hub (CGHub), the European Genome–Phenome Archive, and the ICGC data portal provide cancer genomics data to researchers at a rate of multiple petabytes per month—representing the largest exchange of genomic information in any area of research. New databases, such as the Genome Data Commons (GDC) of the NCI (USA) and the 100,000 Genomes Project (UK), are being constructed. However, these systems are not designed to handle data generated on a scale of millions of samples, as is anticipated with widespread clinical application of next generation sequencing (NGS). This is an entirely new data-engineering challenge.
Most cancer-genomics data generated by clinical applications are held separately in silos by different medical institutions or by their contractors. This makes aggregated data analysis more difficult. Because the data sets are large, now consisting of multiple petabytes, a simple transmission of genome-sequencing data between geographically remote repositories is increasingly infeasible.
The aggregation problem is made more difficult by substantial heterogeneity in the procedures for data collection, storage and representation. Problems of data size can be overcome by sharing only the mutation and gene-expression information from the clinical samples, and not the raw data produced by sequencing machines. However, a lack of consensus in the mutation-calling process, in the methods for gene-expression quantification and even in the data formats used to express this information is hampering current aggregation efforts. Non-standardised ad hoc functional annotation and a lack of consensus between institutions on the clinical importance of genomic variants further limit the universal applicability of NGS data for guiding improvements in patient care. In particular, there are no widely accepted definitions of driver mutations in cancer and “clinically actionable” results. This represents a serious barrier to the integration of clinical genomics into healthcare delivery.
Genomic data in oncology: potential solutions
In response to these diverse challenges, collaborative efforts, including GA4GH-enabled initiatives, have proposed and/or are implementing multiple solutions. By recognising the need to link diverse genomic-data repositories, the GA4GH Data Working Group, in collaboration with initiatives such as the US National Institutes of Health (NIH) Big Data to Knowledge (BD2K) Center in Translational Genomics, are pioneering new standards and methods for sharing genomic information. They are developing a universal application-programming interface (API) that will facilitate the creation of a global, cohesive genome-informatics ecosystem, which would maximise data sharing at scale. Specific task teams within the GA4GH Data Working Group are implementing particular functionalities in the GA4GH API to enable expressive and universal representation of genetic variation; gene and transcript expression; annotation of genomics features; and relationships between genotype and phenotype.
Furthermore, to facilitate the harmonisation of mutation-calling procedures among institutions, the ICGC and TCGA invited the cancer-genomics and bioinformatics communities to work together to identify the best pipelines for the detection of mutations in DNA-sequencing reads for cancer genomes. This has led to the establishment of the ICGC–TCGA dialogue for reverse-engineering assessments and methods (DREAM) somatic mutation calling challenge ('the SMC–DNA meta-pipeline challenge'), a crowd-sourced benchmark of somatic-mutation detection algorithms. The Benchmarking Task Team of the GA4GH Data Working Group is working closely with the DREAM teams to identify the most effective algorithms for widespread use by the scientific and clinical community. The need to identify and share the best data-analysis pipelines has also stimulated considerable work on so-called containerized computation in genomics. In this approach, the development of code that executes different programmes for data processing, analysis and interpretation facilitates easier exchange between different institutions and different computing environments than was previously possible. In containerized computing, one type of packaging for data and programmes enables analysis of the data on all computing systems. The Containers and Workflows Task Team of the GA4GH Working Group is devoted to this area. Furthermore, containerized code that has been battle-tested in the DREAM challenges and by large consortium efforts is now being applied to clinical NGS analysis in cancer—a strategy proposed by groups such as the next-generation sequencing standardization of clinical testing (Nex-StoCT) II informatics work group for the analysis of germline variations in disease.
With regard to the lack of consensus on what constitutes an “actionable” mutation, GA4GH is driving the Actionable Cancer Genome Initiative (ACGI). The main goals of the ACGI are to identify a list of “actionable” genes in different cancers with canonical targetable mutations, as well as rare variants of uncertain importance, and to aggregate data related to these aberrations—their evidence-based curated actionability calls and phenotypic or clinical information (including longitudinal data)—in a searchable format to enhance patient care.
One question that members of the GA4GH have considered in detail is whether the world's genomic and clinical information will reside in a single physical database, or whether it will be made available through a federated network that spans a series of interlinked data repositories in many countries. Although both approaches have their supporters, GA4GH is investigating how a federated model (which may involve a relatively small number of large databases) could be organised to fulfill data-warehousing requirements, while supporting improved data access for data consumers. In a federated system, some data are likely to be on commercial clouds (for example, Amazon, Google, Microsoft or one of the 30 cloud providers in the Helix Nebula Marketplace associated with Europe's Helix Nebula Project) and the rest on government clouds, private clouds or other dedicated systems. The recent decision by the NIH to allow private and commercial cloud-computing solutions to be applied to the storage and analysis of the vast genomic data that is housed in its repository—the database of Genotypes and Phenotypes (dbGaP)—is timely. The creation of a competitive market for such cloud solutions will enable secure and organised data storage at low cost, with sufficient elasticity to provide a dynamic platform that ensures rapid and efficient analysis of large data sets.
Optimised, interoperable technical standards are needed for the analysis of data that are distributed across multiple sites, as suggested above. Beyond the data-harmonisation challenges, there are also substantial technical challenges to ensuring coordinated version control; data uniqueness and integrity; location transparency; harmony and efficiency in access procedures; privacy and security requirements; and to maintaining compliance with institutional and legal regulations at the regional, national and international levels.
Furthermore, in conjunction with the GA4GH's API-based standardisation efforts, the Containers and Workflows Task Team of the GA4GH's Data Working Group is developing mechanisms that will allow a computational procedure to be ported to different institutions, where it can be run locally with reliably consistent results and minimal customisation required. This allows a single institution to perform complex analysis of large data sets at remote sites. On the other hand, when only smaller data items are needed from a remote site, these can be obtained with a simple Internet query, again by using the API. A mechanism for queries of this type is in development by the GA4GH Beacon Project.
Even if technical challenges are addressed, global data sharing will require a marked shift in the conventional ethics framework, especially given diversity in countries' legal and regulatory requirements.