The Wellcome Trust Sanger Institute has sequenced the equivalent of 300 human genomes in just over six months. The Institute has just reached the staggering total of 1,000,000,000,000 letters of genetic code that will be read by researchers worldwide, helping them to understand the role of genes in health and disease. Scientists will be able to answer questions unthinkable even a few years ago and human medical genetics will be transformed.
Some of this is part of the 1000 Genomes Project, an effort to sequence that many human genomes. This will allow us to gain a tremendous amount of insight into just what it is that makes each of us different or the same.
All this PR really states is that they are now capable of sequencing about 45 billion base pairs of DNA a day. They are not directly applying all of that capability to the human genome. While they, or someone, possibly could, the groups involved with 1000 genomes will take a more statistical approach to speed things up and lower costs.
It starts with in depth sequencing of a couple of nuclear families (about 6 people). This will be high resolution sequencing equivalent to 20 passes of the entire genome of each. This level of redundancy will help edit out any sequencing errors from the techniques themselves. All these approaches will help the researchers get a better handle on the most optimal processes to use.
The second step will look at 180 genomes but with only 2 sequencing passes. The high level sequence from the first step will serve as a template for the next 180. The goal here is to be able to rapidly identify sequence variation, not necessarily to make sure every nucleotide is sequenced. It is hoped that the detail learned from step 1 will allow them to be able to infer similar detail here without having to essentially re-sequence the same DNA another 18 times.
Once they have these approaches worked out, and have an idea of the level of genetic variation expected to be seen, they will examine just the cgene oding regions of about 1000 people. This will inform them of how best to proceed to get a more detailed map of an individual’s genome.
This is because the actual differences expected to be found among any two humans’ DNA sequences is expected to be quite low. So they want to identify processes that will highlight these differences as rapidly and effectively as possible.
They were hoping to be sequencing the equivalent of 2 human genomes a day and they are not too far off of that mark. At the end of this study, they will have sequenced and deposited into databases 6 trillion bases (a 6 followed by 12 zeroes). In December 2007, GenBank, the largest American database had a total of 84 billion bases (84 followed by 9 zeroes) that took 25 years to produce.
So this effort will add over 60 times as much DNA sequence to databases as have already been deposited! It plans to to this in only 2 years. The databases, and the tools to examine them, will have to adapt to this huge influx of data.
And, more importantly, the scientists doing the examining will have to appreciate the sheer size of this. It took 13 years to complete the Human Genome Project. Now, 5 years after that project was completed, we can potentially sequence a single human genome in half a day.
The NIH had projected that technology will support sequencing a single human genome in 1 day for under $1000 in 4 years or so. The members of 1000 genomes are hoping to be able to accomplish their work for $30-50,000 per genome. So, the NIH projection may not be too far off.
But what will the databases look like that store and manipulate this huge amount of data? The Sanger Institute is generating 50 Terabytes of data a week, according to the PR.
Maybe I should invest in data storage companies.