This lesson is being piloted (Beta version)

Variant Analysis Workshop: Glossary

Key Points

Intro to Variant Call Format
  • A VCF is a table with samples in columns and SNPs (or other variants) in rows.

  • FORMAT fields contain variant-by-sample data pertaining to genotype calls.

  • INFO fields contain statistics about each variant.

Bioconductor basics
  • FaFile creates a pointer to a reference genome file on your computer.

  • An index file allows quick access to specific information from large files.

  • GRanges stores positions within a genome for any type of feature (SNP, exon, etc.)

  • DNAStringSet stores DNA sequences.

  • SummarizedExperiment stores the results of a set of assays across a set of samples.

Importing a VCF into Bioconductor
  • Index the VCF file with indexTabix if you plan to only import certain ranges.

  • Use filterVcf to filter variants to a new file without importing data into R.

  • Use ScanVcfParam to specify which fields, samples, and genomic ranges you want to import.

Running statistics on SNP markers
  • The snpStats package can convert genotypes to numeric format and calculate statistics.

Working with genome annotations
  • Genome annotations can either be stored as GRanges imported with rtracklayer, or as TxDb imported with GenomicFeatures.

  • Functions that find overlaps between GRanges objects can be used to identify genes near SNPs.

  • The predictCoding function in VariantAnnotation identifies amino acid changes caused by SNPs.

Glossary

Allele frequency
The frequency of an allele in a population. Out of all copies of a locus (in a diploid, two times the number of individuals), what proportion are the allele in question?
Alternative allele
The allele(s) that differ from the reference sequence at a given locus.
Annotation
The locations of genes, transcripts, exons, and CDS in the genome, as well as some metadata about genes.
Candidate gene
A gene hypothesized to impact a trait.
Causative SNP
A SNP that directly affects a trait. (As opposed to simply being linked to the causative mutation.)
CDS
Coding sequence. DNA sequence that can be directly translated to amino acid sequence.
FASTA
A common file format for DNA, RNA, and amino acid sequence. Reference genome sequences are commonly stored as FASTA.
Hardy-Weinberg Equilibrium
A state in which genotype frequencies can be predicted from allele frequencies because all individuals in a population mate at random with each other.
GFF
General Feature Format. A type of file that lists the genome annotation. GFF3 and GTF are types of GFF.
GWAS
Genome-wide association study. Analysis that looks for associations between phenotypes and variants.
Linkage disequilibrium
The extent to which a genotype at one locus can predict the genotype at another locus. Caused by physical linkage as well as population structure.
Minor allele frequency
The frequency of the less common allele. Ranges from 0 to 0.5 by definition.
Reference allele
The allele matching the reference sequence at a given locus.
Reference genome
A genome sequence and annotation for a particular species. Often, the reference is based on the genome of one individual.
SNP
Single nucleotide polymorphism.
VCF
Variant Call Format. A flexible file format for storing the location of variants (SNPs and otherwise) in the genome, as well as genotypes of a set of samples for those variants.