A compendium of bioinformatics terminology commonly used across the API.

Short Reads and BAM

High throughput genome and transcriptome sequencing produces millions of short (50-200 nucleotide) sequences. These sequences are usually referred to as reads. Reads can be produced from:

  1. Complete genomes. These reads can be used to piece together the full genome of an individual.
  2. Exomes. These reads are derived from just the gene regions in the genome (in humans this is a reduction of >97%)
  3. Transcriptomes. Here, the RNA that gets transcribed from the genomic DNA is sequenced, representing only the genes that are active in the tissue that was sampled. Transcriptomes differ from tissue to tissue and can be used to determine differences between tumors and their surrounding tissues.

Reads are usually mapped to a reference genome, for example GRCh37 in humans.

These alignments can be displayed like so:

read1   chr1   234    10M     GACAGTCCCA
read2   chr14  1456   10M     AAAGATTGAC
read3   chrX   2837   7M2I1M  TGGGACTCTA

In this format, the CIGAR string shows how well the read matches the genome: read1 is identical to the genome sequence over all its 10 bases: 10M. The first seven bases of read3 match the genome (7M), but then it has a 2 base insertion (2I), followed by another 1 base match (1M).

The SAM/BAM Format is a way of representing read data. It includes the fields shown above as well as information on read orientation, sequence quality, and optional fields. The format also allows for reads that do not align to the genome, by leaving the reference sequence ID, position, and cigar fields empty.

SAM is a human readable format, BAM is a condensed binary format. The formats can be readily converted to each other.

Genetic variants and VCF

Genetic variants are differences in the genome sequence from one individual to the next. Such variation can manifest at different scales, from small changes affecting just one or a few DNA base pairs, to copy number variations of whole exons or genes, to large structural variations affecting megabases or more. The GA4GH Variants schema focuses on small variants for now, because there’s less consensus on how to represent the larger kinds of variation.

Small genetic variants can be represented as edits to a Reference sequence: typically a tuple of (1) Reference sequence name, (2) starting position of the affected portion on the Reference sequence, (3) DNA sequence of the affected portion of the Reference, and (4) alternate DNA sequence found in place of the Reference sequence. Both the reference and alternate sequences are provided in order to represent sequence insertions and deletions (indels). A few examples:

20     14370   G    A
20     17330   TA   T
20     18302   TG   ACC

The first variant is a single-nucleotide substitution of G to A. The second variant is a deletion of an A at position 17,331. The third variant is a multi-nucleotide change to a lengthier sequence starting at position 18,302.

Given a list of such variants, we can specify the genotype of one or more individuals with respect to each variant. The genotype of a diploid individual (for an autosomal variant) may take one of three distinct values: homozygous reference (0,0), heterozygous (0,1), or homozygous alternate (1,1). We can then present a matrix of genotypes, where the rows are variants as shown above, the columns are the individuals, and each entry is one of those three genotype calls (or marked missing):

CHROM  POS     REF  ALT   Alice   Bob
20     14370   G    A     (0,0)   (0,1)
20     17330   TA   T     (0,0)   (1,1)
20     18302   TG   ACC   (0,1)     -

(If the phase of an individual’s genotypes across several variant positions is known, then the heterozygous genotypes (0,1) and (1,0) may be considered distinct, where the order specifies which homologous chromosome possesses the alternate sequence.)

It’s possible to observe multiple different alternate sequences, or alleles, affecting the same portion of the reference. This can occur even within one individual, if their two homologous chromosomes contain different alternate sequences, and becomes somewhat common when representing variants observed across a population. To handle these cases, we allow a variant to specify multiple alternate alleles. For example:

20     19254   G    A,C,T
20     21672   AT   AC,TGA

And in this case the genotypes can take values such as (0,3) or (1,2). This multi-allelic sites model was refined and popularized in the 1000 Genomes Project’s Variant Call Format (VCF), upon which the Variants schema is based.

There remain some outstanding challenges with this model of small variants. For example, the same edit to the reference sequence can be represented in multiple ways. There are also different ways to represent clusters of alleles that affect overlapping but non-equal portions of the reference. The GA4GH doesn’t yet prescribe resolutions to these ambiguities, and different conventions are used in practice.

(TODO possible additional/advanced topics: homozygous ref vs. no-call; phasing and phase sets; genotype likelihoods; INFO, FORMAT, QUAL, FILTER)