Inspiration

To reduce the time and resources required to find bacterial properties. When a new bacterial strain is discovered, biologists plate the sample on Agar and run a battery of tests on it. Each individual test (i.e. antibiotic resistance, optimal temperature, relation to oxygen, etc.) requires an individual plate and requires significant resources and time. By sequencing an unknown bacterial sample to input into a trained model, scientists will be able to find time critical information in a matter of minutes. Our new method will not only be able to predict many bacterial properties in minutes but can be run without any lab equipment present reducing the cost and manpower required.

How I built it

To train the module, we used Google Cloud's Ml-Engine platform that allowed significantly larger memory use and computing power than we had available. We used the database maintained by the National Center for Biotechnology Information to gather genomic data and its corresponding metadata. We approached this problem with two methods. The first was to input the raw bacterial genome into a model and train it. We quickly found that the massive range (between 1000 and 40 million base pairs) of genomes would be challenging to use. We also found that the massive genome size would require computational power that we did not have access to. This effort resulted in a model with 40% accuracy which was not any better than just guessing. We then pivoted to using protein data. By finding the amino acid chains that make up proteins in bacteria, we theorized that we would be able to predict the same properties with significantly less computing power.

What's next for Gen-Predict

The next step that I am in the process of implementing is utilizing an LSTM to feed and train the data piecewise. The model is currently being trained on google cloud's ml-engine platform.

Built With

Share this project:

Updates