Wednesday, September 26, 2007

Diploid Genome Shows That People Might Differ By 63% of Their Genes

Craig Venter has just sequenced his diploid genome, the results of which were published at PLoS Biology (or you can take a look at the actual map of his diploid genome sequence) and its significance explained in his new book A Life Decoded.

And although they were referred to as complete, they were in fact half-genomes -- or "haploid" -- containing a mom-and-pop mosaic of the 3 billion DNA letters found on just one set of the 23 chromosomes paired in every cell.

Not emphasized in 2001 was the fact that people have in their cells two versions of each of those 23 chromosomes, one from each parent -- a "diploid" genome.

Dr. Venter has spent the last five years and an extra $10 million of his institute’s money in improving the draft genome he prepared at Celera. That genome was based mostly on his own DNA, and the new diploid version is entirely so. It was decoded with an old method, known as Sanger sequencing, that is expensive but analyzes stretches of DNA up to 800 units in length. The cheaper new technologies at present analyze pieces of DNA only 200 units or so long, and the shorter lengths are much harder to assemble into a complete genome.

And unlike the Human Genome Project, whose focus on individual letters made it blind to many larger mutations or variations involving hundreds or thousands of letters, the newer methods that Venter used capture all sizes.
And what did they find? From the report:
Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants.
Or in slightly easier to read prose:
All told, 44 percent of the genes Venter received from one parent were at least a little different from those he inherited from his other parent, and a third of those variations had never been seen in studies of those genes in other people.

One type is called indels, where a single DNA unit has either been inserted or deleted from the genome. Another is copy number variation, in which the same gene can exist in multiple copies. There are also inversions, in which a stretch of DNA has been knocked out of its chromosome and reinserted the wrong way around. Dr. Venter’s genome has four million variations compared with the consortium’s, including three million snips, nearly a million indels and 90 inversions.

Specifically, older analyses suggested that humans' genetic codes are, on average, 99.9 percent identical (or 0.1 percent different), while the new estimate comes in at 99.5 percent (or 0.5 percent different). The true number may be as low as 99 percent, Venter said.
That is amazing to me that 44% of the genes he inherited from his parents were different from each other.

If I understand that correctly, it means that 22% (44%/2) of the genes in each haploid genome has a variation in it. So, for two people to have an identically coded gene, each would have had to have gotten 2 copies (one from each parent) of the gene with no variations. The odds of having a gene with no variations is .78 (1-.22), and the odds of having four of them are just .78^4= 37%. 63% (1-.37) of the time, the coding for any given gene will differ between two people.

If you were to compare your diploid genome with another random individual, of your approximately 25,000 genes, you would share only (25,000*.37=) 9,250 identically coded genes (this assumes that variations are randomly distributed throughout the genome which might be a very bad assumption, so take this number with a grain of salt until real professionals run the numbers).

While at the level of base pairs we are 99.5% similar (14 million base pairs might differ in a haploid genome of 2.8 billion), at the level of genes the differences are much greater as only 37% of them likely to be the same between two individuals. Whether you choose to say we are 99.5% or 37% the same genetically depends on how you look at it. I think the later makes more sense personally.

And what do they still not know?
Although Venter's method produces a 6 billion-letter diploid genome, it does not produce complete paternal and maternal genomes of 3 billion letters each.

There are 4,500 gaps where the sequence of DNA units is uncertain, and no technology yet exists for decoding the large amounts of DNA at the center and tips of the chromosomes.
While they have made great strides, this diploid genome is not really "complete" yet, just much closer. There are likely to be more variations found.

And where do we go from here? Venter sees the following:
I think next year we'll probably see 30 to 50 individual genomes done, and hopefully a major escalation from there. Our goal is to maybe, over the next five years, get as many as 10,000 different complete human genomes.
And how much will it cost?
Cost trends are encouraging. The first 3 billion-letter genome sequences took more than a decade to complete and cost billions of dollars. During Venter's latest project, costs dropped precipitously, and today, several scientists said, an entire diploid genome could probably be done for about $100,000. Some predict that a $1,000 genome will be available within five years.
$1,000 in 5 years! I hope that will happen, but I think it will take more like 10 or 15 years to hit that mark.

Another interesting question is how much space it will take to store your genome. If you were to record every base pair of your diploid genome, there would be around 5.6 billion base pairs which could be stored in 2.8 GB. Put another way you could store it on your 4 GB iPod Nano and still have 1.2 GB left for music.

But, instead of storing all the base pairs, you really only need to store the deltas from the standard genome. If you just record what makes you different, you can reconstruct your own genome with the standard one. Each genome differs by maybe 3.2 million SNPs and another 640,000 non-SNP variations. Multiply that by two for your diploid genome, add in the space to record where these variations occur in your genome (lets say that increases the size 5 times) and that gets you to 20 MB, or about the space a 20 minute MP3 files takes up. Amazing that it takes up more space to save a Dave Matthews Band extended jam song that it does to record what makes you unique genetically.

I can't wait for the day when I can get my genome sequenced.

via NY Times and Washington Post and News Hour


Audacious Epigone said...

Hmm, #41 or my genome sequenced? I guess I would have to go with the latter, perhaps with a tinge of regret!

Great post. Thanks.

al fin said...

There are different levels of looking at genes and gene expression. You're right to say that the extent of genetic difference depends on the level you're analysing. CNV, point mutation, indels, etc. If you look further at the epigenetic mechanisms of gene expression you may find even greater differences. The "junk dna" holds a lot of surprises.

We are unique. We are the same. It depends on what you look at.

Post a Comment

Note: Only a member of this blog may post a comment.