Terminology & Resources
Align/Map – This is the process of mapping, or aligning, DNA reads to where they match the reference genome best. This is how population genomics studies most often work with whole genome data.
Codon – 3 base pairs in a row that “encode” a specific amino acid. A series of codons makes a gene, which is the code of a protein, made of amino acids. Different proteins control how traits develop.
Depth Coverage (Coverage Depth on ActiveSite) – This is the number of DNA reads that align/map to a specific spot in the genome, and is usually recorded for each base pair of the genome. For example, 5X coverage at a base pair would mean that 5 DNA reads have mapped to that exact spot in an overlapping fashion. Average Depth Coverage is the calculation of the overall depth coverage throughout the genome. Depth coverage is never uniform, there will be areas with very low depth coverage and areas with high depth coverage. An average depth coverage of 10X does not mean that there is 10X coverage over the entire genome, but is merely an average of the highs and lows.
DNA Reads -these are the actual sequences of DNA produced as data, the computational code of the genome.
Fixed SNP/ Fixed Difference – this is a special kind of homozygous SNP that has spread throughout an entire population, setting it apart from other populations. The mutation arises and then spreads to the point that it is “fixed” in the genome – in other words you can sequence the DNA of every individual in a population and they will all carry the mutation. Fixed differences separate species, and are the basis for estimating the “genetic distance”, or genetic difference, between them.
Genetic Code – The genetic code is the basis for how DNA controls traits. The base pairs, or letters that make up DNA (A’s, T’s, C’s, and G’s), are organized in words called “codons”, made of 3 base pairs.
Heterozygous SNP – this is a SNP that is different between the two genotypes in an individual. One SNP mutation was inherited from the mother, while a different mutation was inherited from the father. These SNPs are easily discovered in an individual’s genome.
Homozygous SNP – this is a SNP that is the same on both genotypes, or the same from both parents. These can only be discovered when comparing two different individuals.
Non-synonomous SNP – This SNP causes a difference in the genetic code, which can sometimes make a big difference between two organisms. These are the mutations that evolution works with, the basis for selection.
Reference Genome – this is the complete genome code that is used as the “map” to align DNA reads to. The more similar a reference genome is to the individual’s genome being studied, the more DNA will accurately be mapped.
Sequence Coverage – refers to the amount of the reference genome that is mapped by DNA reads. In ActiveSite this is termed “total reference bases covered”. In our data 90% domestic ferret reference genome is covered by black-footed ferret DNA reads, this means that ~2.1 billion base pairs of DNA have been used to generate the data in ActiveSite!
SNP – SNPs are the foundation of genomic studies. A single nucleotide polymorphism, or SNP, is a mutation in the DNA. A SNP can be different between two species, two populations, two individuals, or between the genotypes inherited from each parent in a single individual.
Synonomous SNP – A synonomous SNP does not change the genetic code. This is because more than one codon will code for the same amino acid, much like how many different words can have the same definition in language. Evolution does not shape how these mutations are spread throughout a population.
A Genomics Primer
SNPs are the foundation of genomic studies. A SNP, or single nucleotide polymorphism – literally meaning one (single) base pair (nucleotide) variation among many individuals (polymorphism). In lay terms, it is a mutation found between at least two individuals, and in genomics it is a mutation between two genotypes. A single individual carries two genotypes – one set of DNA (genotype) from the mother and one set of DNA (genotyped) from the father. SNPs can therefore be analyzed within a single individual, or within a population of individuals. SNP’s shared by more than one individual indicate levels of relatedness, while SNP’s found only within one individual, referred to as “singletons”, indicate uniqueness. SNPs come in two forms within a genome – heterozygous and homozygous.
In mendelian genetics the term allele refers to a variation of a gene – you may be familiar with discussions of genes for eye color and hair color and hearing the terms blue eye allele, brown eye allele, blonde hair allele, brown hair allele, etc. The reason these alleles are different is because of the mutations, or SNPs, that make them different. When analyzing genome sequences, a heterozygous SNP is a location in the genome in which one genotype from one parent carries a specific mutation and the genotype from the other parent carries a different mutation. A Homozygous SNP is a mutation in which both genotypes from each parent carry the same mutation. Heterozygous SNPs can be identified in a single individual, while A homozygous SNPs can only be identified between individuals.
Fixed SNPs, or Fixed Differences are a special type of homozygous SNP that are very important to evolution and population genetics. A fixed difference is a SNP that separates two groups of organisms, rather than two individuals. Fixed differences arise early in the evolution between two species, and then spread through the entire population of a species, becoming “fixed” in every individual. Fixed differences are what allow us to analyze the “genetic distance”, or evolutionary difference, between two species.
The way a genome controls traits in an organism is based on the genetic code. The genetic code is what translates a gene into a “gene product” – such as a trait. The genetic code is written in words called “codons“, which are 3 base pairs long. If you change one of the base pairs through mutation, intuitively one will think that the genetic code is changed, but this is not always the case. Some mutations create changes, while others do not. SNPs that cause a difference in the genetic code are non-synonomous SNPs while SNPs that do not cause any difference are termed synonomous SNPs.
When DNA is sequenced to decode the genome, the fragments of DNA sequences are referred to as reads rather than sequences, as to not confuse the data with the process of DNA sequencing – though the two terms are used interchangeably. In population genomics studies, DNA reads are aligned, or mapped, to a reference genome to begin analysis.
Alignment, or “Mapping” is the process of placing a fragment of DNA sequence (read) to where it best matches another individual’s genome – the reference genome. This is literally like taking a puzzle piece and finding where it matches to the picture on the cover of the puzzle box (if the picture and the puzzle were the same size). It is best to use a reference genome that is identical to the genome in the study (like when putting a puzzle together), but if this cannot be obtained then the closest relative’s genome can work. For the black-footed ferret, the domestic ferret’s genome has been sequenced and the code has been assembled (pieced together). The domestic ferret genome is very similar to the black-footed ferret, and 90% of it has been “mapped” by aligning black-footed ferret DNA reads to it, this is something called “sequence coverage” (in ActiveSite it is defined as “total reference bases covered”). This means that the black-footed ferret data accessible through this website represents 2.1 billion base pairs of DNA!
When DNA reads are mapped to a reference genome, a very important statistic for analysis is something called depth coverage (termed coverage depth on ActiveSite). Depth coverage refers to the number of DNA reads that map to the same location on the reference genome, and is analyzed in terms of every single base pair. For example – 5X depth coverage means that at a single base pair the code has been mapped by 5 DNA reads. Average depth coverage, or mean depth coverage, refers to the coverage that any base pair is expected to be covered based on the genome’s depth coverage. An average depth doverage of 10X means that the entire genome is mapped by enough DNA reads to cover each base pair with 10 DNA reads, but in reality some areas of the genome will not be mapped at all (0X coverage) and other areas could be mapped by many reads, as high as 100X coverage.
This is all due to how DNA is sampled and prepared for sequencing. When DNA is prepared for sequencing, it is isolated from cells in tissue – often hundreds and thousands of cells. This means there are hundreds and thousands of copies of the genome in a DNA sample. The DNA is fragmented for sequencing, and in the end when some of the sample is sequenced it is a random sample of a pool of DNA fragments. Some parts of the genome will be captured by more copies than others, this makes for an uneven distribution of mapped reads.
When analyzing heterozygous SNPs and other mutations in the genome, it is important that the mutation has a statistically significant depth coverage in order to be sure that the mutation seen in the DNA reads is genuine, and was not introduced by mistake during the sequencing process (errors are rare, but common enough to pose problems in research). This is why researchers typically reject any SNP that is found on fewer than 25% of the mapped reads. These SNPs are considered errors. A heterozygous SNP should be found on 50% of the mapped reads, but since coverage is not perfect, it is often more or less. To discover if a SNP is heterozygous in ActiveSite, or in your own aligned DNA data, the frequency range is generally accepted as 25-75%, while any SNP at higher than 75% is most likely a homozygous SNP.