Please see the github project for detailed running instructions. The project provides a walkthrough for how to get set up with docker and google's deepvariant, as well as a brief analysis of the data compared with the journal data. It also provides the data from the hypertension journal, as a csv so it is easier to manipulate with traditional data analysis techniques.
What it does
- Parses GWA metastudies to collect data into a more readable format.
- Instantiates a Docker image which produces deepvariant output for a given reference and bam file.
- Provides instructions for how to run deepvariant using a fairly simple docker image
- Provides scripts for pulling example data (both ch20 data, as well as HG002 data), that can be used in this analysis.
- A brief amount of data analysis and a notebook that can be used as a springboard for deeper analysis. This notebook consolidates both the output of deepvariant and the journal information into Dataframes that can be manipulated with more traditional data analysis techniques.
The primary challenge is that the datasets are absolutely massive. A single genomic sequence fell on the order of 110GB of data, and manipulating that with docker keels over most instances that I tried. I tried running on 2 instances from Digital Ocean and 1 from GCP, and none of them provisioned enough memory to store the files talked about here. More specifically, it seems the docker daemon tries to copy the entire build environment, resulting in builds that were hundreds of gigabytes large, and it got too expensive to provision instances that could support these operations.
This means that most of the work was done on a reduced dataset, which was really good for process, but scaling this to full genomic datasets is something that is still TODO.
Accomplishments that I'm proud of
- Actually running deepvariant against a provisioned docker Image,
- Running the full pipeline as it could be used to actually do analysis for a real patient some day
- I'd like to think the documentation for this project is fairly complete
Brief look at the Results
On google's reduced datasets, there were unfortunately no matches between the journal SNPs and what was in the resulting VCF. Therefore, it wasn't a particularly effective analysis in trying to predict hypertension (other than concluding that this sample ch20 data didn't have any indications that this individual would develop hypertension). Running on a full dataset would be awesome and something that unfortunately was a bit too resource intensive to do at this hackathon.
How matching was handled
Since unfortunately deepvariant does not provide SNP ids, matches were judged based on chromosome id and index in the chromosome, to try to determine which SNPs lined up.
What I learned
I learned a lot about bioinformatics, which I think is really cool. This was a great excuse to get my foot into this universe, since this is definitely not something I would've otherwise done. Hopefully I'll have it in me to continue experimenting since the quantity of data out there is incredible.
I also learned a lot about docker's build process when building with massive datasets, and unfortunately the fact of the matter is that it is not super memory efficient, yielding absolutely huge builds :(
What's next for deepvariant-challenge
There are a couple of directions that can be taken with this. The first, is to run it on full genomic sequences. There is data from the 1000 genome project that provides a huge testbed for iterating on this technology here: https://www.internationalgenome.org/data/. This would be a better litmus test as far as how effective the journal data is for prescribing hypertension risk, as currently it
The other one is to pipeline this more effectively. The currrent instructions depend on a fairly large number of technical steps. However, the hope is that this Docker container built in the project could be deployed as a server, and so that way analysis could be kicked off by uploading a sample
BAM file to this deployed server, and then at the end would kick out the VCF files of interest and some cursory analysis of that VCF
As an aside, I personally find this project super interesting so if this is something that Office Ally continues to work on I'd love to stay updated on how it progresses.
Try It out
docker, jupyter, python, shell