Making Data More Accessible for Bioinformatics Training

David Wesley Craig, Ph.D. Apr 1, 2018

Published in The Scientist

Precision medicine is founded on the premise of individualized medical decisions, practices, and treatments tailored to the unique genetic, epigenetic, proteomic, and clinical profiles of patients. Powered by next-generation sequencing technologies, the past five years have seen a burgeoning of patient data; just one of Illumina’s NovaSeq machines, running two to three times a week, could conceivably generate a half trillion bases of sequencing data per year.

Yet for all the data science can produce, it is sorely lacking in the brainpower to analyze the information so it can be put to use. In particular, what are missing are masters-level scientists who could fill the massive skills gap that limits the field’s ability to make new biomedical discoveries and translate them from the laboratory to the bedside.

Take, for example, those half-trillion bases at our disposal. Excel is unable to open files larger than 1 million lines, and that tried-and-true spreadsheet software is the technological limit of many newly minted PhDs and postdocs when it comes to analyzing data. Or researchers may wish to merge public data with their own, a rather rudimentary task that can be challenging for experimentalists, most of whom are not trained in command-line environments. What happens, I have observed, is that trainees emerge from their studies able to differentiate complex calculus, but unable to complete the most basic biomedical data analyses.

While there are programs out there aiming to build up a workforce of bioinformaticians, the lack of educational resources is limiting the breadth of their training.

Science needs more bioinformaticians

Discoveries don’t tend to emerge from large datasets without complex analysis. Thus, the bioinformatician has become one of the most valued members of laboratories across academia, healthcare, and industry. And nowhere is the need more acute than within biomedical research.

I have spent the past decade leading undergrad and graduate research in a “damp laboratory”—a little bit dry lab and little bit wet lab—at the Keck School of Medicine of the University of Southern California (USC). My group melds molecular biology and bioinformatics to develop platforms for personalized medicine, and next-generation sequencing data management, analysis, and clinical genomic interpretation across several fields, including cancer and rare diseases. Ideally, each member of the group would be able to form a hypothesis, conduct an experiment, and do basic analysis, so those insights can occur quickly, without their significance getting lost in translation. But it’s been difficult; I tried hiring PhDs in computer and data science, for instance, only to realize they lacked the tremendous value that comes from several years of experience at the bench.

While there are programs out there aiming to build up a workforce of bioinformaticians, the lack of educational resources is limiting the breadth of their training.

What I learned is that it is much easier to teach a biologist command-line programs such as BASH and statistical scripting languages such as R, and this can be accomplished in just a year or two with a handful of classes. These individuals understand the biological problems and how to apply the informatics solutions using or integrating existing tools. Such well-trained life scientists would be invaluable to any number of biomedical research labs. So why are these people so hard to come by?

With an estimated 183,000 life-science graduates competing for just 12,000 jobs in 2016, it’s a Darwinian struggle for survival as a bachelor’s-level biologist. Even those who are lucky enough to land lab jobs quickly reach a glass ceiling and might be better off as a barista, with average salaries for research assistants with bachelor’s degrees hovering around $30,647, according to Glassdoor. As has been reported often, there is also a glut of life-science PhDs and postdocs.

The challenge of training the bioinformatics workforce

At USC, we are attempting to address the problem through a new master’s degree program in translational biomedical informatics. One of its main objectives is to train those who are transitioning from the bench to the dry lab in academic, clinical, and pharmaceutical research settings. We want to provide students with practical and foundational skills in molecular biology, systems biology, structural biology, proteomics, genomic sequencing, and genomic tools and datasets. We hope they will leave the program able to implement, develop, and design analytical solutions for different health care applications, from prototyping to production. This also involves elements of project management, communication, and collaboration with computational and engineering colleagues.

However, we’ve come across an unexpected hurdle: a dearth of the data we need to train these students.

Aspiring healthcare bioinformaticians need to become familiar with the types of datasets they will be presented within the labs in which they will work. But data from diseased patients is extremely hard to access for training purposes.

Studying samples from healthy controls without a disease phenotype is no substitute for data on real conditions from actual patients; it would be like learning anatomy without a cadaver. A cancer cell looks nothing like its healthy counterpart, and neither do its genomic data. There may be chromosome deletions and duplications, swapped regions, modifications to methylation and expression, or integration of viruses such as HPV.

One of the primary obstacles limiting access to such substantive data is consent. Most trial protocols are not broad enough to include educational use. They specify research, and many resources, such as dbGAP and NIH Commons, even limit data use to lab staff under direct supervision of a PI. There are exceptions, such as the Personal Genome Project (PGP), but they are few. The Texas Cancer Research Biobank (TCRB) Open Access Database is another promising example where specific efforts are being made to obtain consent from individuals with the goal of PGP-type open access, but within the context of relevant disease tissues. We need more.

We want to provide students with practical and foundational skills in molecular biology, systems biology, structural biology, proteomics, genomic sequencing, and genomic tools and datasets.

At USC’s Department of Translational Genomics, we are focused on ensuring that, as a major priority in studies where we seek consent from study participants, we are able to teach students using real data from studies of disease. It starts with basic scientists thinking about this in advance—something that, fortuitously, I had been considering even before leading the master’s program at USC. For instance, we have been able to leverage our work publishing a melanoma line as a potential standard reference line for cancer, COLO-829. The detailed data from an analysis of single nucleotide polymorphisms, indels, structural variants, copy number variations, and transcriptomics are now incredible resources for our students. Another example is a series of synthetic fusions developed to validate our clinical RNA-seq pipeline that we published as an open-access resource for clinical validation. Now that I am at an educational institution, I’m thankful we put that out as a resource. Still, synthetic samples and a single cell line are only a starting point.

Much of the debate about data sharing has focused on the identifiability of genomic data and balancing privacy risks within the research community, leaving education as an afterthought. Let’s reframe the conversation.

Because of the acute need for bioinformaticians now, we have not been focusing on the future. But we cannot neglect the need to build better training programs, incorporating real-world case studies using real data. We need to share primary data for educational use, and create broader consent protocols.

In 2011, Eric Green, the director of the National Human Genome Research Institute, wrote: “It is time to get serious about genomics education for all health care professionals.”

It is time to get serious about providing the materials and the ability to train as well.

David W. Craig is Professor of Translational Genomics and Co-Director of the Institute of Translational Genomics at the Keck School of Medicine of the University of Southern California.