The first result of the National Geographic Genographic project was published in July 2007. The full paper and its attachments can be accessed at the PLOS Genetics web site HERE
Unfortunately for the W and X haplogroups, a key data table was truncated due to Microsoft Excel limitations. Therefore the sequence data for the W individuals was not available. This has now been corrected and a complete advanced copy of the corrected table can be downloaded by right-clicking HERE.
The Genographic Project had collected 76,638 samples with HVS-I (16024-16569) tested. Of these, 16,609 samples were analyzed for haplogroup by testing 22 locations in the coding region (the rest tested only 10 locations in the coding region). These made up what the paper called the reference database. A limitation of the data to mid-2007 was that the great majority of participants were of European descent. However many more samples were expected to be collected in the future from other areas of the world. Relative to W's, this meant that this sample still largely was based on European individuals, as opposed to the undersampled W's in the south Asian homeland and the intervening transit routes to Europe.
The study does not address the geographic origin of the individual's ancestors, which was a key point for the Genographic project. Hopefully this will be rectified in future publications, especially those that add to the public sequences.
Assigning Individuals to Haplogroups
The original 10 coding region locations tested were: 3594, 4580, 5178, 7028, 10400, 10873, 11467, 11719, 12705, and 14766. Later ten additional were added: 4248, 6371, 8994, 10034, 10238, 10550, 12612, 13263, 13368, and 13928. Finally, 2758 and 8280 were added. At a later point 8994 and 13928 were replaced by tests for 1243 and 3970.
In terms of the W haplogroup, the earlier tests identified W by the change at 8994; and this was then replaced by 1243. The final descent tree derived from the entire reference database shows the defining mutations leading from mitochondrial Eve to Wilma were: 2758 => 3594 => 10873 => 1243.
Analysis of the data shows that using HVS-I data alone for assigning an individual to a haplogroup was correct 85.3% of the time; using the 22 coding region locations, 96.72%.
For W, the haplogroup-predicting-motif in HVS-I was 16223T 16292T. However of the 282 W's defined by coding-region changes, 1.4% and 15.2% had lost positions 16223T and 16292T, respectively. Furthermore, while 236 W had the 16223T 16292T motif, it was also shared by 6 L0/L1; 5 L2; 31 L3*; 2 M*; 2 C; 1 D; 3 I; 2 X; and 1 U*. Therefore, based on the 16223T 16292T motif alone, a true W would not be incorrectly identified 17% of the time; and another haplogroup would be erroneously identified as a W 18% of the time. This shows the necessity of coding region testing to be fully confident in assigning W's to this haplogroup.
It was also noted that some W's (the W3 clade) had the same coding region change at 13263 that was used to identify Haplogroup C. However these individuals were correctly identified as W's based on the defining mutation at 1243 and confirmed by the defining 2758 3594 10873 1243 coding region motif mentioned above.
No Neanderthal MTDNA
The entire database was searched for signs of the Neanderthal mtdna motif (16037G, 16139t, 16244A, 16262T, 16263.1A). No sign of this was found, even allowing for subsequent back- and additional- mutations.
The entire database was searched for any sign of recombination - the process whereby the mitochondrial genetic material from two individuals of different haplogroups may have merged. This has been postulated in some studies to explain anomalous dna results, or possible inheritance from the male line (via mitochondria in sperm cells that might merge with that in the egg). No sign of this was found.
How Many Haplotypes?
What are the limitations to the number of haplogroups and haplotypes? Graphing the number of new HVS-I haplotypes discovered as the sample size increased indicated that even with 76,638 samples, the curve was only just beginning to level off at 11,346 haplotypes. Eyeballing the graph, the total number of HVS-I haplotypes is probably in the mid tens-of-thousands (which means an average of around 100,000 people share the same HVS-I haplotype). When restricting the study to the largest haplogroup in the sample (European H), the curve of the graph indicated there may be a total of under 5,000 haplotypes within the 150 million or so H's in the world. Around twice as many haplotypes would be expected within the haplogroups of greater antiquity (e.g. L's, M's, C's, etc).
New W Sequences
The Genographic project listed 54 motifs among the 144 individuals in the reference sample. By contrast, pre-existing data from mitosearch, scientific publications, ftdna, Blood of the Isles, and other studies had identified 129 motifs among 422 individuals. However, it is impossible to determine which method was used to assign all of the 422 individuals to the W haplogroup. As noted above, use of just the HVS-I data could lead to mis-assignment about 15% of the time. It was known that FTDNA did not use coding region tests to assign individuals prior to 2004, and some testing services still might not. Therefore some of the W motifs in the existing data may in fact belong to other haplogroups. This shows the value of having a common reliable data source.
However, there are 57 W haplogroup full-mtdna sequences available at genweb. These display 11 motifs, of which 9 are not in the Genographic sample. This can be attributed to the wider geographic sampling area and shows the importance of the Genographic project expanding its sampling to a large part of Eurasia.
These are the 24 W motifs new to the genographic data not in the previous database:
And these are the 9 W motifs from Genbank not in Genographic data:
Comments? Corrections? Questions? E-mail me!