In my 10 years of teaching undergraduate and graduate students at Harvard and CU Boulder this was by far the most incredible experience yet!
This year Michael Smallegan and I developed a computational lab course -- I never guessed it would result in a primary study that we just published.
This class was motivated by the clear revolution in data-sciences and literate programing for ultimate data-analysis reproducibility. In fact, the term "Bioinformatics" maybe outdated and is more accurately described as the data-sciences of biology. We were inspired that data-science principles in their purity and elegance could be distilled into a course anyone could take -- no terminal experience needed.
Our overall goal was to teach the best practices of data-sciecnes and practically applying them to explore the human genome.
First, the class used bash (unix) to download and uniformly process thousands of data-sets that are publicly available -- thanks to the Encyclopedia of DNA Elements (ENCODE) and their experimental documentation and standards. We found that there were over 1,000 standardized Chromatin Immuno-precipiation (ChIP) sequencing experiments in the erythroid (blood) cell - k562. ChIP is basically a technique to find where on DNA a specific protein binds. In this case we had access to data representing 195 unique DNA binding proteins -- for free!
Note: with over 5,000 simultaneous downloads of ~1GB files we had several MD5 Checksum errors -- check your data downloads!
Next we used the incredibly new resources using nextflow from NF-CORE to containerize commonly used algorithms on genomic data (e.g. BWA, peak calling etc). This ensures if anyone inputs these 1,000+ data files into the NF_CORE ChIP-seq pipeline -- you will get the same results every time. It took 4 days to analyze all these data-seets, but it allowed the students to go from raw sequencing reads to "peaks" of where each DNA binding protein is localized in the human genome.
In the second half of the course we analyzed the 195 DNA binding proteins (DBPs) and how the were choreographed to localize across the human genome. Here we used R markdown principles and literate programing principles (e.g Tidyverse %<%) so the code was as simple as possible. Our goal was to walk through very basic aspects and build towards more complicated statistics, permutation and other analysis.
Enter Covid -19
The ski resorts were reporting Covid cases and we knew that could not be good. March 4th we made a plan as a class to go remote. March 13 <- we were remote on zoom and slack combo.
It was AMAZING !! Instead of walking around from keyboard to keyboard we had break out rooms. Slack was an amazing companion resource where groups could discus and debug code. It was really fun seeing the students all interacting on slack all hours of the day :)
We were off and running faster than in person -- the group dynamics and productivity increased many fold -- we were very privileged to have everyone rally and dive into the human genome as a distraction from the world -- even if a only a few hours a week. At least it was a pleasant space of reprieve.
This was going so well we had finished the course material (including making figures) 2 weeks early! Michael and I proposed to the class that we could go back over all our analysis and write a manuscript as an additional exercise -- unanimous thumbs up on zoom.
-- The discovery --
We had a very bizzare result earlier that 1,362 promoters (on switch) for genes had upto 111 DBPs localized and ready for activity, but the gene remained off -- bizarre considering any other promoter with that many DBPs was very highly expressed. We brought in more data sets to be sure this wasn't an error or a known phenomena (e.g super-enhancers). Several analyses later it was clear these were a new phenomena in the human genome. The students termed these promoters "reservoirs" as they are a storage place for DBPs.
15 Students had a new insight into the human genome -- wow !
We had occupied the last two weeks and had a starting draft of a manuscript by the end of class. I honestly did not want this class to ever end .... so we proposed that we would be available at same class time for the next month to finish the paper. I assumed nobody would take this on -- wrong again -- 5 students came to class (6hrs a week) and continued to code and make figures for a month. This dedication is what was so rewarding about this class.
During this time we had another discovery!! We found a very special transposon family (jumping genes) had a very unique set of DBPs bound -- and despite many transposons families in the genome this one stuck out as very unique. Specifically, the SVA repeat family is a brand new invader of the human genome -- that has an affinity for three specific DBPs. If they bind these invaders come to life and start producing RNA. Further investigation found that SVA repeats are enriched for a very specific type of DNA packing -- or enhancer regions. Thus, these viruses that have recent entered the human genome are quite active -- unlike most of their other counterparts.
The rest is history and now published -- I still am letting that sink in .... The MVP of the class is clearly Michael Smallagen who went far and above in his teaching. Combined with the support of the CU Biochemistry, MCDB adn CS departments, BioFrontiers and IQ Biology. The class was generously supported, facilitated and guest lectures from the BioFrontiers IT crew. A big thank you to Michael Snyder who set aside time for an entire 2 hr lecture and review of the students data -- incredibly generous and inspired the students!
Check out more on reservoirs, zombies, ghosts and SVA regulation in the human genome.
Bình luận