The use of Big Data in Healthcare skyrocketed with the completion of the Human Genome Project in 2003. At the time, the cost of sequencing a genome was $100M and with the advent of new DNA sequencing technologies created by companies like Illumina, Roche and Pacific Biosciences, the cost of sequencing a genome now is less than a $1000. This has also let to a boom in direct-to-customer-genomics via companies like Ancestry and 23andMe.
With the massive amount of biological data being sequenced and accessed, this segment is highly computationally intensive. As a result, data accessibility and the ability to use it is restricted to certain corporations and organizations with proper infrastructure, and not to independent researchers or even for researchers stationed at remote or developing areas.
Our idea was to create a rapid and reliable method for disease surveillance that could be used on laptop computers by researchers and technicians stationed anywhere in the world. For this, our plan is to couple our clinical workflow with low-cost handheld, rapid DNA sequencers like the MinION from Oxford Nanopore.
What it does
In order to be used locally on the user’s laptop computers, the coverage calculator algorithm has been packaged and been tested by individual users on the laptops that they regularly use. The python code and the library has been packaged into a 4.66MB desktop application that can be simply downloaded and run like any other desktop applications. This application uses the operating system’s powershell and has been tested on laptops running on the Windows Operating System.
Through our method, we demonstrated that we could quickly identify the presence of pathogenic bacterium in metagenomics samples. With the construction of composite reference genomes, and using our coverage algorithm, we could confidently and rapidly detect pathogenic bacterium even at very low sample loads.
How I built it
I primarily used python and bash to clean, simulate and test the software after enabling it to understand and work with genetic data (which is just a random sequence of ATGC bases.)
Challenges I ran into
The biggest challenge is the size of the data the program works with. We use next-generation-sequencing data (currently oral and gut metagenomes from the Sequence Read Archive (SRA)). The main challenge is to serve the data to the application and make it fast enough to be used in real-time.
Accomplishments that I'm proud of
The coverage calculator algorithm has been demonstrated to output a coverage in under a minute to 2-3 minutes depending upon the size of the marker file of the composite reference genome being tested against the metagenomics sample for coverage. Even though the time required to test larger composite reference genomes containing more species, or larger sequence length, might stretch to a few more minutes; this method is still very fast compared to testing against thousands of individual, pathogenic bacterial reference genomes from various species, genus and families. Both the methods and algorithms along with the helper algorithms for testing were developed to run locally on laptop computers. The Coverage calculator program and the ancillary methods and helper algorithms for testing were all developed on a laptop with Windows Operating system (Win 10) with 4GB of RAM.
What I learned
The project required a huge learning curve that involved using, simulating, testing huge amount of genetic/biological data and making sure the results were reproducible. Biological systems behave different from physical systems and there are new challenges in every step.
What's next for Prediktr : A Big Data Based Disease Surveillance Tool
For testing actual samples at remote locations, researchers or technicians would not require a complete laboratory set-up as the program can be coupled with low-cost sequencers like the hand-held, Oxford Nanopore Technology based MinION sequences. The hand held sequencers are just catching up to their more conventional counterparts like Illumina HiSeq sequencers. Since these handheld sequencers have been able to generate long reads; many potential applications have cropped up including de novo genome assemblies of new genomes as well as in diagnosis of pathogenic genomes.
amazon-web-services, django, docker, flask, gcp, kubernetes, python, rest-api