We are searching data for your request:
Upon completion, a link will appear to access the found materials.
I was wondering what is the range of SNP density across species and model organisms.
I.e., what would be a reasonable estimate for the SNP density (i.e., x SNPs for each 1 Kb) in:
- humans with the highest variation (i.e., Africans)
- lab mice with highest variation (i.e., isogenic lab mouse 1 with isogenic lab mouse 2 with the highest variation)
- model organism with the highest snp density (could be lab or naturally occurring)
Human = ~1.18 SNPs per kbases
Based on CgsSNP, the average numbers of SNPs per 10 kb was 8.33, 8.44, and 8.09 in the human genome, in intergenic regions, and in genic regions, respectively.
Source: Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution. by Zhao Z, Fu YX, Hewett-Emmett D, Boerwinkle E.
Drosophila melanogaster = ~0.02 SNP per kbases
FLYSNPdb provides high-resolution single nucleotide polymorphism (SNP) data of Drosophila melanogaster. The database currently contains 27 367 polymorphisms, including >3700 indels (insertions/deletions), covering all major chromsomes. These SNPs are clustered into 2238 markers, which are evenly distributed with an average density of one marker every 50.3 kb or 6.6 genes.
source: FLYSNPdb: a high-density SNP database of Drosophila melanogaster
Mouse(Mus musculus) = ~0.2 SNP per kbases
Source: Genome-wide single-nucleotide polymorphism analysis defines haplotype patterns in mouse
Genetic diversity in humans and non-human primates and its evolutionary consequences
Genetic diversity is a key parameter in population genetics and is important for understanding the process of evolution and for the development of appropriate conservation strategies. Recent advances in sequencing technology have enabled the measurement of genetic diversity of various organisms at the nucleotide level and on a genome-wide scale, yielding more precise estimates than were previously achievable. In this review, I have compiled and summarized the estimates of genetic diversity in humans and non-human primates based on recent genome-wide studies. Although studies on population genetics demonstrated fluctuations in population sizes over time, general patterns have emerged. As shown previously, genetic diversity in humans is one of the lowest among primates however, certain other primate species exhibit genetic diversity that is comparable to or even lower than that in humans. There exists greater than 10-fold variation in genetic diversity among primate species, and I found weak correlation with species fecundity but not with body or propagule size. I further discuss the potential evolutionary consequences of population size decline on the evolution of primate species. The level of genetic diversity negatively correlates with the ratio of non-synonymous to synonymous polymorphisms in a population, suggesting that proportionally greater numbers of slightly deleterious mutations segregate in small rather than large populations. Although population size decline is likely to promote the fixation of slightly deleterious mutations, there are molecular mechanisms, such as compensatory mutations at various molecular levels, which may prevent fitness decline at the population level. The effects of slightly deleterious mutations from theoretical and empirical studies and their relevance to conservation biology are also discussed in this review.
Causes of differences between individuals include independent assortment, the exchange of genes (crossing over and recombination) during reproduction (through meiosis) and various mutational events.
There are at least three reasons why genetic variation exists between populations. Natural selection may confer an adaptive advantage to individuals in a specific environment if an allele provides a competitive advantage. Alleles under selection are likely to occur only in those geographic regions where they confer an advantage. A second important process is genetic drift, which is the effect of random changes in the gene pool, under conditions where most mutations are neutral (that is, they do not appear to have any positive or negative selective effect on the organism). Finally, small migrant populations have statistical differences - called the founder effect - from the overall populations where they originated when these migrants settle new areas, their descendant population typically differs from their population of origin: different genes predominate and it is less genetically diverse.
In humans, the main cause [ citation needed ] is genetic drift. Serial founder effects and past small population size (increasing the likelihood of genetic drift) may have had an important influence in neutral differences between populations. [ citation needed ] The second main cause of genetic variation is due to the high degree of neutrality of most mutations. A small, but significant number of genes appear to have undergone recent natural selection, and these selective pressures are sometimes specific to one region.  
Genetic variation among humans occurs on many scales, from gross alterations in the human karyotype to single nucleotide changes.  Chromosome abnormalities are detected in 1 of 160 live human births. Apart from sex chromosome disorders, most cases of aneuploidy result in death of the developing fetus (miscarriage) the most common extra autosomal chromosomes among live births are 21, 18 and 13. 
Nucleotide diversity is the average proportion of nucleotides that differ between two individuals. As of 2004, the human nucleotide diversity was estimated to be 0.1%  to 0.4% of base pairs.  In 2015, the 1000 Genomes Project, which sequenced one thousand individuals from 26 human populations, found that "a typical [individual] genome differs from the reference human genome at 4.1 million to 5.0 million sites … affecting 20 million bases of sequence" the latter figure corresponds to 0.6% of total number of base pairs.  Nearly all (>99.9%) of these sites are small differences, either single nucleotide polymorphisms or brief insertions or deletions (indels) in the genetic sequence, but structural variations account for a greater number of base-pairs than the SNPs and indels.  
As of 2017 [update] , the Single Nucleotide Polymorphism Database (dbSNP), which lists SNP and other variants, listed 324 million variants found in sequenced human genomes. 
Single nucleotide polymorphisms Edit
A single nucleotide polymorphism (SNP) is a difference in a single nucleotide between members of one species that occurs in at least 1% of the population. The 2,504 individuals characterized by the 1000 Genomes Project had 84.7 million SNPs among them.  SNPs are the most common type of sequence variation, estimated in 1998 to account for 90% of all sequence variants.  Other sequence variations are single base exchanges, deletions and insertions.  SNPs occur on average about every 100 to 300 bases  and so are the major source of heterogeneity.
A functional, or non-synonymous, SNP is one that affects some factor such as gene splicing or messenger RNA, and so causes a phenotypic difference between members of the species. About 3% to 5% of human SNPs are functional (see International HapMap Project). Neutral, or synonymous SNPs are still useful as genetic markers in genome-wide association studies, because of their sheer number and the stable inheritance over generations. 
A coding SNP is one that occurs inside a gene. There are 105 Human Reference SNPs that result in premature stop codons in 103 genes. This corresponds to 0.5% of coding SNPs. They occur due to segmental duplication in the genome. These SNPs result in loss of protein, yet all these SNP alleles are common and are not purified in negative selection. 
Structural variation Edit
Structural variation is the variation in structure of an organism's chromosome. Structural variations, such as copy-number variation and deletions, inversions, insertions and duplications, account for much more human genetic variation than single nucleotide diversity. This was concluded in 2007 from analysis of the diploid full sequences of the genomes of two humans: Craig Venter and James D. Watson. This added to the two haploid sequences which were amalgamations of sequences from many individuals, published by the Human Genome Project and Celera Genomics respectively. 
According to the 1000 Genomes Project, a typical human has 2,100 to 2,500 structural variations, which include approximately 1,000 large deletions, 160 copy-number variants, 915 Alu insertions, 128 L1 insertions, 51 SVA insertions, 4 NUMTs, and 10 inversions. 
Copy number variation Edit
A copy-number variation (CNV) is a difference in the genome due to deleting or duplicating large regions of DNA on some chromosome. It is estimated that 0.4% of the genomes of unrelated humans differ with respect to copy number. When copy number variation is included, human-to-human genetic variation is estimated to be at least 0.5% (99.5% similarity).     Copy number variations are inherited but can also arise during development.    
A visual map with the regions with high genomic variation of the modern-human reference assembly relatively to a Neanderthal of 50k  has been built by Pratas et al. 
Epigenetic variation is variation in the chemical tags that attach to DNA and affect how genes get read. The tags, "called epigenetic markings, act as switches that control how genes can be read."  At some alleles, the epigenetic state of the DNA, and associated phenotype, can be inherited across generations of individuals. 
Genetic variability Edit
Genetic variability is a measure of the tendency of individual genotypes in a population to vary (become different) from one another. Variability is different from genetic diversity, which is the amount of variation seen in a particular population. The variability of a trait is how much that trait tends to vary in response to environmental and genetic influences.
In biology, a cline is a continuum of species, populations, varieties, or forms of organisms that exhibit gradual phenotypic and/or genetic differences over a geographical area, typically as a result of environmental heterogeneity.    In the scientific study of human genetic variation, a gene cline can be rigorously defined and subjected to quantitative metrics.
In the study of molecular evolution, a haplogroup is a group of similar haplotypes that share a common ancestor with a single nucleotide polymorphism (SNP) mutation. The study of haplogroups provides information about ancestral origins dating back thousands of years. 
The most commonly studied human haplogroups are Y-chromosome (Y-DNA) haplogroups and mitochondrial DNA (mtDNA) haplogroups, both of which can be used to define genetic populations. Y-DNA is passed solely along the patrilineal line, from father to son, while mtDNA is passed down the matrilineal line, from mother to both daughter or son. The Y-DNA and mtDNA may change by chance mutation at each generation.
Variable number tandem repeats Edit
A variable number tandem repeat (VNTR) is the variation of length of a tandem repeat. A tandem repeat is the adjacent repetition of a short nucleotide sequence. Tandem repeats exist on many chromosomes, and their length varies between individuals. Each variant acts as an inherited allele, so they are used for personal or parental identification. Their analysis is useful in genetics and biology research, forensics, and DNA fingerprinting.
Short tandem repeats (about 5 base pairs) are called microsatellites, while longer ones are called minisatellites.
Recent African origin of modern humans Edit
The recent African origin of modern humans paradigm assumes the dispersal of non-African populations of anatomically modern humans after 70,000 years ago. Dispersal within Africa occurred significantly earlier, at least 130,000 years ago. The "out of Africa" theory originates in the 19th century, as a tentative suggestion in Charles Darwin's Descent of Man,  but remained speculative until the 1980s when it was supported by the study of present-day mitochondrial DNA, combined with evidence from physical anthropology of archaic specimens.
According to a 2000 study of Y-chromosome sequence variation,  human Y-chromosomes trace ancestry to Africa, and the descendants of the derived lineage left Africa and eventually were replaced by archaic human Y-chromosomes in Eurasia. The study also shows that a minority of contemporary populations in East Africa and the Khoisan are the descendants of the most ancestral patrilineages of anatomically modern humans that left Africa 35,000 to 89,000 years ago.  Other evidence supporting the theory is that variations in skull measurements decrease with distance from Africa at the same rate as the decrease in genetic diversity. Human genetic diversity decreases in native populations with migratory distance from Africa, and this is thought to be due to bottlenecks during human migration, which are events that temporarily reduce population size.  
A 2009 genetic clustering study, which genotyped 1327 polymorphic markers in various African populations, identified six ancestral clusters. The clustering corresponded closely with ethnicity, culture and language.  A 2018 whole genome sequencing study of the world's populations observed similar clusters among the populations in Africa. At K=9, distinct ancestral components defined the Afroasiatic-speaking populations inhabiting North Africa and Northeast Africa the Nilo-Saharan-speaking populations in Northeast Africa and East Africa the Ari populations in Northeast Africa the Niger-Congo-speaking populations in West-Central Africa, West Africa, East Africa and Southern Africa the Pygmy populations in Central Africa and the Khoisan populations in Southern Africa. 
Population genetics Edit
Because of the common ancestry of all humans, only a small number of variants have large differences in frequency between populations. However, some rare variants in the world's human population are much more frequent in at least one population (more than 5%). 
It is commonly assumed that early humans left Africa, and thus must have passed through a population bottleneck before their African-Eurasian divergence around 100,000 years ago (ca. 3,000 generations). The rapid expansion of a previously small population has two important effects on the distribution of genetic variation. First, the so-called founder effect occurs when founder populations bring only a subset of the genetic variation from their ancestral population. Second, as founders become more geographically separated, the probability that two individuals from different founder populations will mate becomes smaller. The effect of this assortative mating is to reduce gene flow between geographical groups and to increase the genetic distance between groups. [ citation needed ]
The expansion of humans from Africa affected the distribution of genetic variation in two other ways. First, smaller (founder) populations experience greater genetic drift because of increased fluctuations in neutral polymorphisms. Second, new polymorphisms that arose in one group were less likely to be transmitted to other groups as gene flow was restricted. [ citation needed ]
Populations in Africa tend to have lower amounts of linkage disequilibrium than do populations outside Africa, partly because of the larger size of human populations in Africa over the course of human history and partly because the number of modern humans who left Africa to colonize the rest of the world appears to have been relatively low.  In contrast, populations that have undergone dramatic size reductions or rapid expansions in the past and populations formed by the mixture of previously separate ancestral groups can have unusually high levels of linkage disequilibrium 
Distribution of variation Edit
The distribution of genetic variants within and among human populations are impossible to describe succinctly because of the difficulty of defining a "population," the clinal nature of variation, and heterogeneity across the genome (Long and Kittles 2003). In general, however, an average of 85% of genetic variation exists within local populations,
7% is between local populations within the same continent, and
8% of variation occurs between large groups living on different continents.   The recent African origin theory for humans would predict that in Africa there exists a great deal more diversity than elsewhere and that diversity should decrease the further from Africa a population is sampled.
Phenotypic variation Edit
Sub-Saharan Africa has the most human genetic diversity and the same has been shown to hold true for phenotypic variation in skull form.   Phenotype is connected to genotype through gene expression. Genetic diversity decreases smoothly with migratory distance from that region, which many scientists believe to be the origin of modern humans, and that decrease is mirrored by a decrease in phenotypic variation. Skull measurements are an example of a physical attribute whose within-population variation decreases with distance from Africa.
The distribution of many physical traits resembles the distribution of genetic variation within and between human populations (American Association of Physical Anthropologists 1996 Keita and Kittles 1997). For example,
90% of the variation in human head shapes occurs within continental groups, and
10% separates groups, with a greater variability of head shape among individuals with recent African ancestors (Relethford 2002).
A prominent exception to the common distribution of physical characteristics within and among groups is skin color. Approximately 10% of the variance in skin color occurs within groups, and
90% occurs between groups (Relethford 2002). This distribution of skin color and its geographic patterning — with people whose ancestors lived predominantly near the equator having darker skin than those with ancestors who lived predominantly in higher latitudes — indicate that this attribute has been under strong selective pressure. Darker skin appears to be strongly selected for in equatorial regions to prevent sunburn, skin cancer, the photolysis of folate, and damage to sweat glands. 
Understanding how genetic diversity in the human population impacts various levels of gene expression is an active area of research. While earlier studies focused on the relationship between DNA variation and RNA expression, more recent efforts are characterizing the genetic control of various aspects of gene expression including chromatin states,  translation,  and protein levels.  A study published in 2007 found that 25% of genes showed different levels of gene expression between populations of European and Asian descent.      The primary cause of this difference in gene expression was thought to be SNPs in gene regulatory regions of DNA. Another study published in 2007 found that approximately 83% of genes were expressed at different levels among individuals and about 17% between populations of European and African descent.  
Wright's Fixation index as measure of variation Edit
The population geneticist Sewall Wright developed the fixation index (often abbreviated to FST) as a way of measuring genetic differences between populations. This statistic is often used in taxonomy to compare differences between any two given populations by measuring the genetic differences among and between populations for individual genes, or for many genes simultaneously.  It is often stated that the fixation index for humans is about 0.15. This translates to an estimated 85% of the variation measured in the overall human population is found within individuals of the same population, and about 15% of the variation occurs between populations. These estimates imply that any two individuals from different populations are almost as likely to be more similar to each other than either is to a member of their own group.   "The shared evolutionary history of living humans has resulted in a high relatedness among all living people, as indicated for example by the very low fixation index (FST) among living human populations." Richard Lewontin, who affirmed these ratios, thus concluded neither "race" nor "subspecies" were appropriate or useful ways to describe human populations. 
Wright himself believed that values >0.25 represent very great genetic variation and that an FST of 0.15–0.25 represented great variation. However, about 5% of human variation occurs between populations within continents, therefore FST values between continental groups of humans (or races) of as low as 0.1 (or possibly lower) have been found in some studies, suggesting more moderate levels of genetic variation.  Graves (1996) has countered that FST should not be used as a marker of subspecies status, as the statistic is used to measure the degree of differentiation between populations,  although see also Wright (1978). 
Jeffrey Long and Rick Kittles give a long critique of the application of FST to human populations in their 2003 paper "Human Genetic Diversity and the Nonexistence of Biological Races". They find that the figure of 85% is misleading because it implies that all human populations contain on average 85% of all genetic diversity. They argue the underlying statistical model incorrectly assumes equal and independent histories of variation for each large human population. A more realistic approach is to understand that some human groups are parental to other groups and that these groups represent paraphyletic groups to their descent groups. For example, under the recent African origin theory the human population in Africa is paraphyletic to all other human groups because it represents the ancestral group from which all non-African populations derive, but more than that, non-African groups only derive from a small non-representative sample of this African population. This means that all non-African groups are more closely related to each other and to some African groups (probably east Africans) than they are to others, and further that the migration out of Africa represented a genetic bottleneck, with much of the diversity that existed in Africa not being carried out of Africa by the emigrating groups. Under this scenario, human populations do not have equal amounts of local variability, but rather diminished amounts of diversity the further from Africa any population lives. Long and Kittles find that rather than 85% of human genetic diversity existing in all human populations, about 100% of human diversity exists in a single African population, whereas only about 70% of human genetic diversity exists in a population derived from New Guinea. Long and Kittles argued that this still produces a global human population that is genetically homogeneous compared to other mammalian populations. 
Archaic admixture Edit
There is a hypothesis that anatomically modern humans interbred with Neanderthals during the Middle Paleolithic. In May 2010, the Neanderthal Genome Project presented genetic evidence that interbreeding did likely take place and that a small but significant portion, around 2-4%, of Neanderthal admixture is present in the DNA of modern Eurasians and Oceanians, and nearly absent in sub-Saharan African populations.  
Between 4% and 6% of the genome of Melanesians (represented by the Papua New Guinean and Bougainville Islander) are thought to derive from Denisova hominins – a previously unknown species which shares a common origin with Neanderthals. It was possibly introduced during the early migration of the ancestors of Melanesians into Southeast Asia. This history of interaction suggests that Denisovans once ranged widely over eastern Asia. 
Thus, Melanesians emerge as the most archaic-admixed population, having Denisovan/Neanderthal-related admixture of
In a study published in 2013, Jeffrey Wall from University of California studied whole sequence-genome data and found higher rates of introgression in Asians compared to Europeans.  Hammer et al. tested the hypothesis that contemporary African genomes have signatures of gene flow with archaic human ancestors and found evidence of archaic admixture in the genomes of some African groups, suggesting that modest amounts of gene flow were widespread throughout time and space during the evolution of anatomically modern humans. 
New data on human genetic variation has reignited the debate about a possible biological basis for categorization of humans into races. Most of the controversy surrounds the question of how to interpret the genetic data and whether conclusions based on it are sound. Some researchers argue that self-identified race can be used as an indicator of geographic ancestry for certain health risks and medications.
Although the genetic differences among human groups are relatively small, these differences in certain genes such as duffy, ABCC11, SLC24A5, called ancestry-informative markers (AIMs) nevertheless can be used to reliably situate many individuals within broad, geographically based groupings. For example, computer analyses of hundreds of polymorphic loci sampled in globally distributed populations have revealed the existence of genetic clustering that roughly is associated with groups that historically have occupied large continental and subcontinental regions (Rosenberg et al. 2002 Bamshad et al. 2003).
Some commentators have argued that these patterns of variation provide a biological justification for the use of traditional racial categories. They argue that the continental clusterings correspond roughly with the division of human beings into sub-Saharan Africans Europeans, Western Asians, Central Asians, Southern Asians and Northern Africans Eastern Asians, Southeast Asians, Polynesians and Native Americans and other inhabitants of Oceania (Melanesians, Micronesians & Australian Aborigines) (Risch et al. 2002). Other observers disagree, saying that the same data undercut traditional notions of racial groups (King and Motulsky 2002 Calafell 2003 Tishkoff and Kidd 2004  ). They point out, for example, that major populations considered races or subgroups within races do not necessarily form their own clusters.
Furthermore, because human genetic variation is clinal, many individuals affiliate with two or more continental groups. Thus, the genetically based "biogeographical ancestry" assigned to any given person generally will be broadly distributed and will be accompanied by sizable uncertainties (Pfaff et al. 2004).
In many parts of the world, groups have mixed in such a way that many individuals have relatively recent ancestors from widely separated regions. Although genetic analyses of large numbers of loci can produce estimates of the percentage of a person's ancestors coming from various continental populations (Shriver et al. 2003 Bamshad et al. 2004), these estimates may assume a false distinctiveness of the parental populations, since human groups have exchanged mates from local to continental scales throughout history (Cavalli-Sforza et al. 1994 Hoerder 2002). Even with large numbers of markers, information for estimating admixture proportions of individuals or groups is limited, and estimates typically will have wide confidence intervals (Pfaff et al. 2004).
Genetic clustering Edit
Genetic data can be used to infer population structure and assign individuals to groups that often correspond with their self-identified geographical ancestry. Jorde and Wooding (2004) argued that "Analysis of many loci now yields reasonably accurate estimates of genetic similarity among individuals, rather than populations. Clustering of individuals is correlated with geographic origin or ancestry."  However, identification by geographic origin may quickly break down when considering historical ancestry shared between individuals back in time. 
An analysis of autosomal SNP data from the International HapMap Project (Phase II) and CEPH Human Genome Diversity Panel samples was published in 2009. The study of 53 populations taken from the HapMap and CEPH data (1138 unrelated individuals) suggested that natural selection may shape the human genome much more slowly than previously thought, with factors such as migration within and among continents more heavily influencing the distribution of genetic variations.  A similar study published in 2010 found strong genome-wide evidence for selection due to changes in ecoregion, diet, and subsistence particularly in connection with polar ecoregions, with foraging, and with a diet rich in roots and tubers.  In a 2016 study, principal component analysis of genome-wide data was capable of recovering previously-known targets for positive selection (without prior definition of populations) as well as a number of new candidate genes. 
Forensic anthropology Edit
Forensic anthropologists can assess the ancestry of skeletal remains by analyzing skeletal morphology as well as using genetic and chemical markers, when possible.  While these assessments are never certain, the accuracy of skeletal morphology analyses in determining true ancestry has been estimated at about 90%. 
Gene flow and admixture Edit
Gene flow between two populations reduces the average genetic distance between the populations, only totally isolated human populations experience no gene flow and most populations have continuous gene flow with other neighboring populations which create the clinal distribution observed for moth genetic variation. When gene flow takes place between well-differentiated genetic populations the result is referred to as "genetic admixture".
Admixture mapping is a technique used to study how genetic variants cause differences in disease rates between population.  Recent admixture populations that trace their ancestry to multiple continents are well suited for identifying genes for traits and diseases that differ in prevalence between parental populations. African-American populations have been the focus of numerous population genetic and admixture mapping studies, including studies of complex genetic traits such as white cell count, body-mass index, prostate cancer and renal disease. 
An analysis of phenotypic and genetic variation including skin color and socio-economic status was carried out in the population of Cape Verde which has a well documented history of contact between Europeans and Africans. The studies showed that pattern of admixture in this population has been sex-biased and there is a significant interactions between socio economic status and skin color independent of the skin color and ancestry.  Another study shows an increased risk of graft-versus-host disease complications after transplantation due to genetic variants in human leukocyte antigen (HLA) and non-HLA proteins. 
Differences in allele frequencies contribute to group differences in the incidence of some monogenic diseases, and they may contribute to differences in the incidence of some common diseases.  For the monogenic diseases, the frequency of causative alleles usually correlates best with ancestry, whether familial (for example, Ellis-van Creveld syndrome among the Pennsylvania Amish), ethnic (Tay–Sachs disease among Ashkenazi Jewish populations), or geographical (hemoglobinopathies among people with ancestors who lived in malarial regions). To the extent that ancestry corresponds with racial or ethnic groups or subgroups, the incidence of monogenic diseases can differ between groups categorized by race or ethnicity, and health-care professionals typically take these patterns into account in making diagnoses. 
Even with common diseases involving numerous genetic variants and environmental factors, investigators point to evidence suggesting the involvement of differentially distributed alleles with small to moderate effects. Frequently cited examples include hypertension (Douglas et al. 1996), diabetes (Gower et al. 2003), obesity (Fernandez et al. 2003), and prostate cancer (Platz et al. 2000). However, in none of these cases has allelic variation in a susceptibility gene been shown to account for a significant fraction of the difference in disease prevalence among groups, and the role of genetic factors in generating these differences remains uncertain (Mountain and Risch 2004).
Some other variations on the other hand are beneficial to human, as they prevent certain diseases and increase the chance to adapt to the environment. For example, mutation in CCR5 gene that protects against AIDS. CCR5 gene is absent on the surface of cell due to mutation. Without CCR5 gene on the surface, there is nothing for HIV viruses to grab on and bind into. Therefore, the mutation on CCR5 gene decreases the chance of an individual's risk with AIDS. The mutation in CCR5 is also quite common in certain areas, with more than 14% of the population carry the mutation in Europe and about 6–10% in Asia and North Africa. 
Apart from mutations, many genes that may have aided humans in ancient times plague humans today. For example, it is suspected that genes that allow humans to more efficiently process food are those that make people susceptible to obesity and diabetes today. 
Neil Risch of Stanford University has proposed that self-identified race/ethnic group could be a valid means of categorization in the US for public health and policy considerations.   A 2002 paper by Noah Rosenberg's group makes a similar claim: "The structure of human populations is relevant in various epidemiological contexts. As a result of variation in frequencies of both genetic and nongenetic risk factors, rates of disease and of such phenotypes as adverse drug response vary across populations. Further, information about a patient’s population of origin might provide health care practitioners with information about risk when direct causes of disease are unknown."  However, in 2018 Noah Rosenberg released a study arguing against genetically essentialist ideas of health disparities between populations stating environmental variants are a more likely cause Interpreting polygenic scores, polygenic adaptation, and human phenotypic differences
Human genome projects are scientific endeavors that determine or study the structure of the human genome. The Human Genome Project was a landmark genome project.
Materials and Methods
Discovery methods were iterative and adapted for different transcriptome datasets as they emerged from our laboratory. First, primers were selected directly from chum salmon (O. keta) 454 assemblies . Additional SNP primers were selected from SOLiD sequence assemblies from sockeye salmon . These latter sequences originated from 10 fish from five locations ( Figure 1 red circles Table 1 ).
See Table 1 for location names corresponding to numbers. Sockeye salmon collected for SOLiD sequencing and initial SNP ascertainment  are marked with red circles. Samples collected for SNP validation are marked with blue circles. Collections used for SNP assessment and ranking at all 114 SNP loci are marked with green diamonds.
|SNP discovery||Bristol Bay, Alaska||1||Yako Creek||2|
|3||Silverhorn Bay Beach||2|
|Southcentral Alaska||5||Mendeltna Creek||2|
|SNP discovery||Kamchatka Peninsula||6||Hapiza River||8|
|validation||Bristol Bay, Alaska||7||Deer Creek||8|
|9||Upper Nushagak-Klutapuk Creek||8|
|11||Upper Talarik Creek||8|
|12||Ualik Lake tributary||8|
|Alaska Peninsula||15||Hatchery Beach, Chignik||8|
|Southcentral Alaska||20||Yentna River slough||8|
|21||Susitna River slough||8|
|Southeast Alaska||23||Klukshu River, Alsek||8|
|24||Hugh Smith Lake||8|
|British Columbia, Canada||26||Scud River||8|
|27||Taku River mainstem||8|
|29||Meziadin Lake Beach||8|
|SNP assessment||Kamchatka Peninsula||30||Bolshaya River||90|
|Bristol Bay, Alaska||32||Lake Kulik||68|
|Alaska Peninsula||34||Bear Lake||93|
|Southcentral Alaska||36||Coghill Lake||89|
|British Columbia, Canada||38||Upper Tatshenshini River||88|
Primers were designed and tested for PCR amplification of a single product on a single pooled sample of DNA. Successful primers were then used to screen individuals for SNPs using HRMA as in McGlauflin et al. . HRMA was performed following the manufacturer's instructions on Lightcycler 480 (Roche Diagnostics) platform using eight test fish from each of 24 locations (192 fish total Figure 1 blue circles Table 1 ). These locations were chosen to focus upon Bristol Bay populations and also include a few representatives from the eastern and western Pacific Ocean.
Putative SNPs that were successfully detected using HRMA were selected for Sanger sequencing. Sequences where the identity of the SNP was confirmed by the presence of at least two genotypes were used for designing primers and probes for the 5′-nuclease assays. As a final validation step, each assay was then tested by genotyping the same panel of 192 fish that were used for HRMA. Assays that did not perform well or where the SNP deviated from Hardy-Weinberg expectations (HWE) were discarded (HWE was tested on a subset of populations for which we possessed additional samples of (N =). The Sanger sequences used for 5′-nuclease assay design were used to annotate validated markers using the NCBI sequence database and Blastx. Only assays where the most similar sequence hit had an e-value ρ.0E-10 were annotated.
Six pairs of population samples (hereafter referred to as assessment populations) were chosen from throughout the species' range to assess within and among region variability ( Figure 1 green diamonds Table 1 ). All fish from the 12 assessment populations were genotyped at 114 nuclear loci (Table S1) using 5′-nuclease assays . These SNPs included the 43 new SNPs described in this paper, 68 previously published SNPs for sockeye salmon , , ,  and three unpublished markers from the Department of Fisheries and Oceans Canada (Molecular Genetics Laboratory, Pacific Biological Station, Department of Fisheries and Oceans Canada).
Tissues (heart, liver, fin, or axillary process) or genomic DNA were obtained from archived samples at the University of Washington (UW), the Alaska Department of Fish and Game (ADF&G), and the Washington Department of Fish and Wildlife (WDFW). Genomic DNA was extracted as necessary using the DNeasy96 Blood and Tissue Kit (QIAGEN, USA).
The markers were first evaluated using standard population genetic indices using the 12 assessment populations ( Table 1 ) as follows. Populations were tested for deviations from HWE at each locus using chi-square tests as implemented in GenAlEx 6.2 . All critical values were corrected for multiple comparisons using a sequential Bonferroni correction . Allelic richness was calculated for each locus in each population using FSTAT v.220.127.116.11  to look for effects of ascertainment bias. Differences in average allelic richness among locations were tested for significance with an ANOVA. Linkage disequilibrium was tested in each collection for each pair of SNPs using Genepop 4 . To check for genotyping error, 8% of each collection was genotyped again.
Population differentiation was measured as FST  at each locus using Genepop 4 and between population pairs across all loci using Arlequin 3.5 . A principal coordinate analysis with six coordinates was performed in GenAlEx to visualize the genetic relationship among populations. Arlequin was also used to detect outlier loci, candidates for directional selection , across the entire range using the hierarchical island model with six regions ( Table 1 ), 20,000 simulations, 100 demes, and 50 groups. Detection of candidate loci was based on Beaumont and Nichols original work using heterozygosity and high differentiation to identify outlier loci . The value of these outlier loci to resolve populations was investigated by removing these loci from the data set and then re-measuring genetic differentiation between populations. Significance of differences in genetic differentiation measured with outlier loci and without outlier loci was tested using a Mantel test.
Each locus was ranked according to five measures: FST , informativeness as calculated by Rosenberg (In ), average contribution of a locus to principal component analysis (LC), BELS ranking , and WHICHLOCI . We additionally considered the ranking approach GAFS of Topchy et al.  GAFS was not implemented because of its similarity to BELS and computational costs . Each method used is summarized in Table 2 . FST, LC, and In are all measures of genetic diversity based on differences in allele frequencies observed at a locus, while BELS and WHICHLOCI are scores based on maximizing the likelihood of assigning a genotype to the correct population. Informativeness (In) has been shown to be correlated with FST by Rosenberg et al. . Informativness's relationship to LC was determined using a Spearman's rank correlation. The LC was determined using a multivariate locus comparison method developed by Moazami-Goudarzi and Laloë  and implemented in S-Plus (MathSoft, Inc, 2000). Here, locus contribution was determined for the first five principal components.
|FST||Scaled among-population variance in allele frequency||Weir & Cockerham 1984|
|Locus contribution (LC)||Average contribution of each locus to principal components||Moazami-Goudarzi & Laloë 2002|
|Informativeness for assignment (In)||Estimates potential for an allele to be assigned to one population in comparison to an average population||Rosenberg et al. 2003|
|BELS||Ranks a locus' performance for maximizing mixture estimation accuracy during individual assignment||Bromaghin 2008|
|WHICHLOCI||Determines locus efficiency for correct population assignment and propensity to cause false assignment||Banks et al. 2003|
BELS and WHICHLOCI provide each locus a rank based on the accuracy of individual assignment for that locus and the value lost when the locus is removed from the panel in a jackknife fashion. Loci that result in the greatest loss in individual assignment performance when removed receive the highest score. Both of these locus-ranking programs were run with resampling for a simulated population size of 200 individuals and with 250 iterations. No critical population was defined. In WHICHLOCI, minimum correct assignment was set at 95.0%. In BELS, the performance measure was designated to maximize mean individual assignment accuracy for 100% correct assignment. For BELS, the role of locus input order was explored by running the analyses with four different locus orders: alphabetical, reverse alphabetical, and two randomly generated locus orders. Differences in locus ranks for each input order were tested in a pairwise fashion using the Wilcoxon Signed Rank test.
Initially, each locus was ranked using all individuals available (full set) for the twelve SNP assessment populations ( Table 1 ). However, to reduce the potential for upward bias introduced when loci are ranked and assessed using the same individuals, Anderson's Simple Training and Holdout method  was implemented. Half of each assessment population was randomly selected for locus ranking (training set). For odd numbered population size the extra individual was assigned to the training set. The remaining individuals (holdout set) were reserved for panel testing. Significance of differences in locus ranks using the full population set and the training population set were tested using the Wilcoxon signed rank test.
SNP panels were designed to assess the value of increasing the number of markers included in a panel and to evaluate the different measures for ranking SNPs using the 12 assessment populations. Two panel sizes were selected, 48 and 96 SNPs, to test for differences in resolving power when the number of markers was increased. These panel sizes represent the capabilities of high-throughput genotyping platforms commonly in use at that time (e.g. ).
We assembled seven pairs of 48-SNP and 96-SNP panels. Using the training set, five pairs were created from the top ranked loci for each locus measure. A sixth pair of panels was constructed from top ranked loci (based on their average rank). Finally, a seventh pair of panels was constructed from randomly selected loci.
Each panel was tested for performance with two different methods. Using the program ONCOR , assignment tests were performed assigning holdout set individuals from each assessment population ( Table 1 ) to a baseline of the training set individuals that had been used for SNP ranking. Since the origin of assigned individuals was known, the probability of assignment to the population of origin was reported for assignment accuracy. The second method used to assess panel performance was a simulation of individual assignment described by Rosenberg  as implemented by Ackerman et al. . These simulations use the allele frequencies for user-described populations to assign a simulated individual back to the correct population and report the probability that this assignment is correct. Here individuals were simulated using allele frequencies from holdout set individuals for each population. For each panel, individual assignment was simulated 500 times with 1000 individuals in R, and the frequency of correct assignment (f ORCA ) was reported.
Differences in panel performance for both assessment methods were tested for using an ANOVA and the post hoc Tukey's Honestly Significant Difference test (α =𠂠.01).
In addition to panel testing, we examined the value of using the full set of loci and the change in assignment accuracy with decreasing panel size after the subsequent removal of loci. Beginning with the full set of 110 polymorphic loci, ONCOR was used to determine probability of correct assignment similarly to 96- and 48- SNP panel assessment. Loci were then excluded five at a time by lowest average rank (Table S1) until only the five top ranked loci remained for individual assignment. Mean values and 1 st and 3 rd quartiles were calculated from the resulting probabilities in Excel (Microsoft for Macs 2011).
Behavioral interactions and syndromes
Field observations of zebrafish behavior are few and anecdotal, and so much of what zebrafish do in nature has to be inferred from their behavior in the lab. One behavior that has received considerable attention is the formation of loose social aggregations, or shoals, which have been observed in the field (see Figure 1C) and studied in the lab (Engeszer et al., 2007b Gerlai, 2014). This behavior might provide protection from predators, improved foraging success, or access to mates. Shoaling increases steadily from early larval stages, and individuals ‘imprint’ on a particular visual phenotype, showing a preference for this phenotype by the time they are juveniles (Engeszer et al., 2004, 2007a Spence and Smith, 2007 Mahabir et al., 2013). Interestingly, wild-caught and lab fish (both previously imprinted on the ‘wild type’) have similar preferences for prospective shoaling partners when presented with fish that have different pigment patterns and other phenotypes, although the specifics differ between sexes: female preferences appear to be complex, whereas males show strong preferences that correlate with stripe quality and species identity (Engeszer et al., 2008). Many additional factors might also influence whether or not zebrafish shoal together in the wild, including fish size, group size, sex ratio, olfactory stimuli, kin recognition, predation risk and light regime (e.g., Pritchard et al., 2001 Gerlach and Lysiak, 2006 Ruhl et al., 2009).
Lab strains of zebrafish spawn all year round, but breeding in the wild occurs primarily during the summer monsoons, when ephemeral pools appear these presumably offer plenty to eat and some shelter from currents and predators. Still waters might also facilitate pheromonal communication relevant to oogenesis and courtship (Bloom and Perlmutter, 1977 van den Hurk and Lambert, 1983 van den Hurk et al., 1987 Gerlach, 2006). Spawning tends to occur near daybreak, and can involve male territoriality, as well as female preferences for oviposition (egg-laying) sites (Spence et al., 2007a, 2008). Lab studies indicate that courtship and mating behaviors are stereotypic, although some of the details may depend on the conditions in which observations have been made. Behaviors include the initial approach chasing by the male and touching of the male's nose to the female's side or tail male circling and quivering the female leading the male to an oviposition site, or the male pinning the female against an object and oviposition itself (Darrow and Harris, 2004 Sessa et al., 2008 Kang et al., 2013). Females can lay up to several hundred eggs at once, or smaller numbers every few days, but the actual number of offspring from any given spawning is highly variable. Indeed, males can differ in the clutch sizes they elicit from females, (Spence and Smith, 2006), possibly owing to differences in body size (Skinner and Watt, 2007) dominance hierarchies can also influence reproductive success (Paull et al., 2010). Although reproductive maturity can be reached in as little as 4–6 weeks in the lab, where zebrafish are known to live for up to several years, we don't as yet know about the timing of their maturation or their longevity in the wild. A deeper understanding of courtship and breeding preferences, as well as life history in nature, will be interesting, and may facilitate research in the lab through improvements in spawning and rearing efficiencies (Sessa et al., 2008 Adatto et al., 2011 Nasiadka and Clark, 2012).
Recently, wild zebrafish brought to the lab have provided new insights into behavioral syndromes, in which behaviors co-vary, as in a continuum of boldness and aggression, or correlated changes that occur during domestication (for example, changes in both fearfulness and activity patterns) that likely derive from intentional selection on some traits and relaxed selection on others (Moretz et al., 2007 Norton et al., 2011). Including wild zebrafish in such studies dramatically expands the range of variation. Indeed, comparisons of zebrafish isolated from different geographic regions, and different lab strains, have revealed striking differences in behavioral syndromes among populations (Robison and Rowland, 2005 Oswald and Robison, 2008 Drew et al., 2012 Martins and Bhat, 2014). That such differences can be heritable (Wright et al., 2006 Oswald et al., 2013) suggests that the genetic bases for natural variation in behavioral syndromes, and the evolution of behavioral traits more generally, can be studied using this species. Of critical importance to all of these endeavors are additional observations and experiments in the field, in order to better understand the zebrafish behavioral repertoire and its significance for individual fitness, and also to determine the extent to which habitat differences between field and lab might impact our ability to generalize results from one context to the other.
# Signature in the SNPs
Previously, we drew an extended analogy between how species form and how languages change over time. Through this analogy we came to understand that species, like a group speaking a language, are continuous populations that shift their characteristics incrementally over time. From this analogy we concluded that “species” is a label of convenience for what is in fact a population undergoing continuous, incremental change – just as a “language” is not static but in constant, if gradual, flux. With this understanding we are ready to begin examining our genome with the correct ideas in mind. The question is not, then “when did our species begin?”, since that question is like asking when “English” began. What genetics can address, however, is how many individuals were in our ancestral population as it separated from other species and became a distinct lineage. There are a number of genetic approaches that allow for such estimates, and we will examine a few of them in this series. Importantly, all of the methods return very similar estimates – that our ancestral population has not dropped below about 10,000 individuals over the last several million years. Since the earliest-known anatomically modern humans are present in the fossil record at 200,000 years ago, this minimum population size spans the time during which we biologically “became human” as a population.
Genetic variation and population size
Each of the population-size methods we will examine base their calculations on the amount of genetic variation in present-day human populations. For any given section of DNA in our genome, any one person can have at most only two versions of it – one received from their mother, and the other from their father. A large population, however, can have many more versions than just two. The amount of genetic variation in a population then, is connected to the number of individuals in that population. At its most basic, every estimate of ancestral human population size uses present-day genetic variation to estimate how many ancestors are needed to transmit the observed level of variation to the present day. A large amount of present-day genetic variation requires more ancestors than does a small amount.
The advent of genome sequencing, as you might expect, has shed a great deal of light on how much genetic variation is present in modern human populations. One significant source of human genetic variation comes in the form of what are known as single nucleotide polymorphisms, or “SNPs” (pronounced “snips”). “Polymorphism” simply means “having many forms”. SNPs are single DNA letters that are variable among humans, and we have around 300,000 common SNPs in our genome of 3 billion DNA letters. In other words, the majority of our genomes are identical to each other, but a small number of DNA letter positions on our chromosomes are variable. Consider a short section of DNA sequence for six different individuals, with three variable positions:
For any one SNP position, there are a maximum of four possible versions (since there are four DNA letters). Once we consider a few SNPs linked together on the same chromosome, however, the number of possible combinations becomes very large. For example, for just the three SNPs shown above, there are 64 different possible combinations (4 x 4 x 4, or 4 3 ). Twenty SNPs, on the other hand, would have 4 20 possible combinations, more than the number of people on the planet. For the six individuals above, we can see that there are five different combinations present. The most likely explanation for these five variants is that they were inherited from five different ancestors, and that persons 5 and 6 inherited their identical combination from the same ancestor. There are other, less likely possibilities, however: some of the combinations might result from new mutations, or from mixing and matching between the different SNPs. For example, person 4 and persons 5 and 6 differ by only one letter: person 4 has an “a” for SNP 1 where persons 5 and 6 have a “t”. One possibility that we need to account for is that person 4 might be descended from the same ancestor as persons 5 and 6, but that a new mutation from t → a occurred at the SNP 1 location. Another possibility is that there was recombination, through a process called “crossing over”, that placed a “t” into this position in person 4. So, when using SNP variation to count the number of likely ancestors, we need to factor in mutation and recombination rates, both of which we can measure directly in humans. In practice, the effects of mutation are small on using SNPs to estimate ancestral population sizes, since the mutation rate in humans is very, very low. Direct measurements of the rate have been done by sequencing the entire genomes of parents and offspring, and on average there are only about 100 – 150 new mutations every time we copy our genome of three billion DNA letters. The effects of recombination can also be minimized by choosing SNPs that are linked closely together on the same chromosome. SNPs that are closely linked together recombine only rarely, since there is so little space for crossing over to occur between them. While scientists factor in mutation and recombination rates, in practice they are not a major issue for SNP-based methods.
In practice, population size estimates based on SNP variation is simply a matter of sequencing a large number of people from around the globe, cataloging them for various SNPs, and estimating how many ancestors they would need to have the SNP variation we see in the present day. As you might expect, different people groups have characteristic sets of SNP variants within them. This makes sense, of course, because we know that the various groups are more closely related to each other than across groups. Tallying up the number of ancestors using this method consistently returns a total minimum population size of about 10,000 individuals: approximately 8,000 ancestors are needed to explain SNP diversity in sub-Saharan Africa, and about 2,000 ancestors for everyone else. SNP diversity in humans is far too large to result from one ancestral couple at any time in the last 200,000 years – we descend from a population. These values are also in good agreement with older, cruder methods of estimating population size from other types of genetic variation, giving us increased confidence that they are reasonable.
Biological properties of constraint
We investigated the properties of genes and transcripts as a function of their tolerance to pLoF variation (LOEUF). First, we found that LOEUF correlates with the degree of connection of a gene in protein-interaction networks (r = −0.14 P = 1.7 × 10 −51 after adjusting for gene length) (Fig. 4a) and functional characterization (Extended Data Fig. 8a). In addition, constrained genes are more likely to be ubiquitously expressed across 38 tissues in the Genotype-Tissue Expression (GTEx) project (Fig. 4b) (LOEUF r = −0.31 P < 1 × 10 −100 ) and have higher expression on average (LOEUF ρ = −0.28 P < 1 × 10 −100 ), consistent with previous results 4 . Although most results in this study are reported at the gene level, we have also extended our framework to compute LOEUF for all protein-coding transcripts, allowing us to explore the extent of differential constraint of transcripts within a given gene. In cases in which a gene contained transcripts with varying levels of constraint, we found that transcripts in the first LOEUF decile were more likely to be expressed across tissues than others in the same gene (n = 1,740 genes), even when adjusted for transcript length (Fig. 4c) (constrained transcripts are on average 6.34 transcripts per million higher P = 2.2 × 10 −14 ). Furthermore, we found that the most constrained transcript for each gene was typically the most highly expressed transcript in tissues with disease relevance 24 (Extended Data Fig. 8c), which supports the need for transcript-based variant interpretation, as explored in more depth in an accompanying manuscript 15 .
a, The mean number of protein–protein interactions is plotted as a function of LOEUF decile: more constrained genes have more interaction partners (LOEUF linear regression r = −0.14 P = 1.7 × 10 −51 ). Error bars correspond to 95% confidence intervals. b, The number of tissues where a gene is expressed (transcripts per million > 0.3), binned by LOEUF decile, is shown as a violin plot with the mean number overlaid as points: more constrained genes are more likely to be expressed in several tissues (LOEUF linear regression r = −0.31 P < 1 × 10 −100 ). c, For 1,740 genes in which there exists at least one constrained and one unconstrained transcript, the proportion of expression derived from the constrained transcript is plotted as a histogram.
Finally, we investigated potential differences in LOEUF across human populations, restricting to the same sample size across all populations to remove bias due to differential power for variant discovery. As the smallest population in our exome dataset (African/African American) has only 8,128 individuals, our ability to detect constraint against pLoF variants for individual genes is limited. However, for well-powered genes (expected pLoF ≥ 10) (Supplementary Information), we observed a lower mean observed/expected ratio and LOEUF across genes among African/African American individuals, a population with a larger effective population size, compared with other populations (Extended Data Fig. 8d, e), consistent with the increased efficiency of selection in populations with larger effective population sizes 25,26 .
Genomic Data Science Working Group
The NHGRI Genomic Data Science Working Group is a subcommittee of the National Advisory Council for Human Genome Research (NACHGR). The working group was created in 2017 to facilitate a deeper engagement of the NACHGR in the numerous and increasingly complex issues at the interface between genomics and data science.
The NHGRI Genomic Data Science Working Group is a subcommittee of the National Advisory Council for Human Genome Research (NACHGR). The working group was created in 2017 to facilitate a deeper engagement of the NACHGR in the numerous and increasingly complex issues at the interface between genomics and data science.
The variability of sedentary V̇O2,max
There is substantial evidence that V̇O2,max varies considerably among subjects who claim to abstain from systematic exercise training. Bouchard et al. (1998) found V̇O2,max to vary by more than twofold among 429 sedentary individuals from 86 families, as reported in his pioneering HERITAGE (HEalth, RIsk factors, exercise Training And GEnetics) family study. Bouchard et al. (1998) defined sedentary subjects as subjects that did not engage in more than one weekly exercise session of maximally 30 min duration at an energy expenditure of 7 METs for subjects ≥50 years and 8 METs for subjects <50 years.
This enormous interindividual difference in sedentary V̇O2,max begs the question as to the hereditary contribution to this phenotypic trait. Klissouras (1971) was the first to study the heritability of V̇O2,max in a systematic way, comparing pairs of monozygotic twins (MZ, n=15) with pairs of dizygotic twins (DZ, n=10). Klissouras (1971) found V̇O2,max to be virtually identical between MZ and DZ however, when V̇O2,max values of twins were pairwise regressed, the correlation coefficient r for MZ was 0.91 while that for DZ was only 0.44. From this, Klissouras (1971) calculated the heritable component of V̇O2,max to be 0.93.
Since then there have been a number of twin-sibling studies, as reviewed by Schutte et al. (2016). These authors performed a sample size weighted meta-analysis on all heritability studies of maximal oxygen consumption in children, adolescents and young adults. They found that 59% (n=1088) and 72% (n=1004) of the variability of measured V̇O2,max or of V̇O2,max relative to body mass (V̇O2,max/Mb), respectively, could be explained by genetic influences. Overall, they concluded that innate factors determine more than 50% of the inter-individual differences in V̇O2,max in the sampled populations.
The heredity estimates for V̇O2,max reported by Schutte et al. (2016) are higher than those reported by Bouchard et al. (1998) for the HERITAGE family study. Bouchard et al. (1998) analyzed sedentary V̇O2,max data obtained from the members of 86 nuclear families using stepwise multiple regression procedures. An analysis of variance (ANOVA) was performed to verify the family aggregation of V̇O2,max by comparing between-family and within-family variance. The analysis showed this variance to be 2.6 to 2.9 times larger between than within families. Depending on adjustments (for sex, body mass, fat mass and fat-free mass), heritability estimates between 51% and 59% were calculated. Using appropriate regression models, Bouchard et al. (1998) were also able to calculate the maternal (mitochondrial) contribution to general heritability – they reported maternal contribution to be 29–36% (more than half of total heritability). Bouchard et al. (1998) denote these heritability estimates as ‘maximal’ as their approach did not allow for isolating and subtracting the contribution of familial environment to overall heritability.
In summary, the currently available data indicate a large inter-individual variability of sedentary V̇O2,max, spanning at least a twofold range between the lowest and the highest estimates. Twin-sibling studies and familial-resemblance studies both indicate that at least 50% of this variability is of genetic origin, with maternal, mitochondrial inheritance being more than half of the total inheritance.
Genome-wide association studies: how does genetics relate to common diseases?
The Human Genome Project made it possible to ask and address new types of scientific questions. One example of such an important question is determining which SNPs increase or decrease risk for a given disease (recall that SNPs are genetic bases which can differ between people). Before the HGP, if scientists wanted to answer this type of question, they could only realistically focus on a few small regions of the genome at a time. Now, it would theoretically be possible to sequence many people with and without a disease and systematically test each base in the genome, asking: is one version of a SNP more common in people who have the disease? This type of study design is called a genome-wide association study (GWAS) (Figure 3). One of the important considerations for GWAS is cost efficiency, as sequencing the entire genome is still far too expensive to perform on large numbers of people. Therefore, scientists often use a cheaper approach: selecting hundreds of thousands of known SNPs ahead of time and testing each individual’s genotype at only those SNPs.
Figure 3. Genome-wide association studies. This image depicts an overview of the genome-wide association study (GWAS) procedure, where scientists collect DNA information from patients and healthy controls and then systematically test for SNPs that are associated with having the disease of interest.
Scientists had previously been fairly successful at determining the genes that cause many rare and severe diseases, such as cystic fibrosis and sickle-cell anemia. For these types of diseases, often a single SNP with an extremely strong effect could be pinpointed (though it’s important to note that gene discovery does not immediately translate into therapeutic drug development – it is only the first step of a long and complex process). It seemed natural to hope that GWAS would prove similarly effective at determining the genetic basis for more common diseases, such as heart disease, diabetes, inflammatory bowel disease, and schizophrenia. In the first years of GWAS, however, it became apparent that matters would not be so simple: the findings suggested that a very large number of genes – for some traits, easily into the hundreds or perhaps even thousands – might have effects on a given disease. Moreover, these effects tended to be very small for each SNP (for example, a given SNP that affects risk for obesity is usually associated with gaining only a fraction of a pound).
This conceptual discovery has been an important advance in our understanding of human biology. In the context of drug development, this finding means that targeting a single gene with a drug may not cure all people with a particular disease scientists are working to use information gained from GWAS to develop and improve therapeutic treatments.
Outstanding questions about the natural history of house mice.
Although house mice have been studied for more than a century, there are still important questions about the basic biology of wild house mice that remain largely unanswered. Answers to these questions would further strengthen the mouse as a model for research.
What is the nature and extent of variation in morphology, physiology, reproduction, and development among wild house mice that have adapted to live in different environments? Although house mice are known to occur in a wide variety of environments, we still know relatively little about their physiological ecology. For example, how do some house mice survive extreme cold, high elevations, or extremely arid regions? Further study of such populations would likely provide additional mouse models for important phenotypes.
What are the determinants of social structure in mice? Mice sometimes live in small demes and sometimes live in larger aggregations, but much remains to be learned about the causes of these differences.
Which pathogens and parasites are present in house mice from different areas? Infectious agents can be a powerful evolutionary force, yet we know little about natural infections of mice from different places. Pathogens are likely to vary between temperate and tropical areas, but this remains largely uninvestigated. Similarly, mice from different areas may have evolved resistance to different pathogens, but this too is mostly unstudied.
What determines the limits to the distribution of house mice? They are amazingly successful at colonizing new areas, but they are not found everywhere. How important is competition with native rodents in determining the distribution of house mice?
Which genes underlie adaptation? Our understanding of the genetic basis of adaptive differences is in its infancy. There have been some genome scans for selection, but few instances in which specific genes have been linked to specific phenotypes.
What is the structure of haplotype blocks in natural wild mouse populations? Understanding haplotype structure is important for characterizing and understanding the evolution of recombination and will also lay the groundwork for association studies using wild mice.