variants

This file defines the objects used to represent variant calls, most importantly VariantSet, Variant, and Call. See {TODO: LINK TO VARIANTS OVERVIEW} for more information.
message VariantSetMetadata
Fields:
  • key (string) – The top-level key.
  • value (string) – The value field for simple metadata.
  • id (string) – User-provided ID field, not enforced by this API. Two or more pieces of structured metadata with identical id and key fields are considered equivalent. FIXME: If it’s not enforced, then why can’t it be null?
  • type (string) – The type of data.
  • number (string) – The number of values that can be included in a field described by this metadata.
  • description (string) – A textual description of this metadata.
  • attributes (Attributes) – A map of additional information about the metadata record.

This metadata represents VCF header information.

message VariantSet
Fields:
  • id (string) – The variant set ID.
  • name (string) – The variant set name.
  • dataset_id (string) – The ID of the dataset this variant set belongs to.
  • reference_set_id (string) – The ID of the reference set that describes the sequences used by the variants in this set.
  • metadata (list of VariantSetMetadata) – Optional metadata associated with this variant set. This array can be used to store information about the variant set, such as information found in VCF header fields, that isn’t already available in first class fields such as “name”.

A VariantSet is a collection of variants and variant calls intended to be analyzed together.

message CallSet
Fields:
  • id (string) – The call set ID.
  • name (string) – The call set name.
  • biosample_id (string) – The Biosample the call set data was generated from.
  • variant_set_ids (string) – The IDs of the variant sets this call set has calls in.
  • created (long) – The date this call set was created in milliseconds from the epoch.
  • updated (long) – The time at which this call set was last updated in milliseconds from the epoch.
  • attributes (Attributes) – A map of additional information about the Call Set.

A CallSet is a collection of calls that were generated by the same analysis of the same sample.

message Call
Fields:
  • call_set_name (string) – The name of the call set this variant call belongs to. If this field is not present, the ordering of the call sets from a SearchCallSetsRequest over this VariantSet is guaranteed to match the ordering of the calls on this Variant. The number of results will also be the same.
  • call_set_id (string) –

    The ID of the call set this variant call belongs to.

    If this field is not present, the ordering of the call sets from a SearchCallSetsRequest over this VariantSet is guaranteed to match the ordering of the calls on this Variant. The number of results will also be the same.

  • genotype (ListValue) –

    The genotype of this variant call.

    A 0 value represents the reference allele of the associated Variant. Any other value is a 1-based index into the alternate alleles of the associated Variant.

    If a variant had a referenceBases field of “T”, an alternateBases value of [“A”, “C”], and the genotype was [2, 1], that would mean the call represented the heterozygous value “CA” for this variant. If the genotype was instead [0, 1] the represented value would be “TA”. Ordering of the genotype values is important if the phaseset field is present. Missing genotype genotypes may be indicated using the “dot annotation” [”.”, ”.”], as specified in VCF4.2; this is e.g. used for types of structural variants.

  • phaseset (string) – If this field is populated, this variant call’s genotype ordering implies the phase of the bases and is consistent with any other variant calls on the same contig which have the same phaseset string.
  • genotype_likelihood (double) – The genotype likelihoods for this variant call. Each array entry represents how likely a specific genotype is for this call as log10(P(data | genotype)), analogous to the GL tag in the VCF spec. The value ordering is defined by the GL tag in the VCF spec.
  • attributes (Attributes) – A map of additional information about the Call.

A Call represents the determination of genotype with respect to a particular Variant.

It may include associated information such as quality and phasing. For example, a call might assign a probability of 0.32 to the occurrence of a SNP named rs1234 in a call set with the name NA12345.

message Variant
Fields:
  • id (string) – The variant ID.
  • variant_set_id (string) – The ID of the VariantSet this variant belongs to. This transitively defines the ReferenceSet against which the Variant is to be interpreted.
  • names (string) – Names for the variant, for example a RefSNP ID.
  • created (long) – The date this variant was created in milliseconds from the epoch.
  • updated (long) – The time at which this variant was last updated in milliseconds from the epoch.
  • reference_name (string) – The reference on which this variant occurs. (e.g. chr20 or X)
  • start (long) – The start position at which this variant occurs (0-based). This corresponds to the first base of the string of reference bases. Genomic positions are non-negative integers less than reference length. Variants spanning the join of circular genomes are represented as two variants one on each side of the join (position 0).
  • end (long) – The end position (exclusive), resulting in [start, end) closed-open interval. This is typically calculated by start + referenceBases.length.
  • reference_bases (string) – The reference bases for this variant. They start at the given start position.
  • alternate_bases (string) – The bases that appear instead of the reference bases. Multiple alternate alleles are possible.
  • attributes (Attributes) – A map of additional information about the Variant.
  • calls (list of Call) – The variant calls for this particular variant. Each one represents the determination of genotype with respect to this variant. Call`s in this array are implicitly associated with this `Variant.
  • variant_type (string) – The “variant_type” is used to denote e.g. structural variants. Examples: DUP : duplication of sequence following “start”; not necessarily in situ DEL : deletion of sequence following “start”
  • svlen (long) – Length of the - if labeled as such in variant_type - structural variation. Based on the use in VCFv4.2
  • cipos (sint32) – In the case of structural variants, start and end of the variant may not be known with an exact base position. “cipos” provides an interval with high confidence for the start position. The interval is provided by 0 or 2 signed integers which are added to the start position. Based on the use in VCFv4.2 Example: [ -12000, 1000 ]
  • ciend (sint32) – Similar to “cipos”, but for the variant’s end position (which is derived from start + svlen). Example: [ -1000, 0 ]
  • filters_applied (boolean) – True if filters were applied for this variant. VCF column 7 “FILTER” any value other than the missing value.
  • filters_passed (boolean) – True if all filters for this variant passed. VCF column 7 “FILTER” value PASS.
  • filters_failed (string) – Zero or more filters that failed for this variant. VCF column 7 “FILTER” shared across all alleles in the same VCF record.

A Variant represents a change in DNA sequence relative to some reference. For example, a variant could represent a SNP or an insertion. Variants belong to a VariantSet. This is equivalent to a row in VCF.