Metadata

This protocol defines metadata used in the other GA4GH protocols.

enum Strand
Symbols:NEG_STRAND|POS_STRAND

Indicates the DNA strand associate for some data item. * NEG_STRAND: The negative (-) strand. * POS_STRAND: The postive (+) strand.

record Position
Fields:
  • referenceName (string) – The name of the Reference on which the Position is located.
  • position (long) –
    The 0-based offset from the start of the forward strand for that Reference.
    Genomic positions are non-negative integers less than Reference length.
  • strand (Strand) – Strand the position is associated with.

A Position is an unoriented base in some Reference. A Position is represented by a Reference name, and a base number on that Reference (0-based).

record ExternalIdentifier
Fields:
  • database (string) –
    The source of the identifier.
    (e.g. Ensembl)
  • identifier (string) –
    The ID defined by the external database.
    (e.g. ENST00000000000)
  • version (string) –
    The version of the object or the database
    (e.g. 78)

Identifier from a public database

enum CigarOperation
Symbols:ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH

An enum for the different types of CIGAR alignment operations that exist. Used wherever CIGAR alignments are used. The different enumerated values have the following usage:

  • ALIGNMENT_MATCH: An alignment match indicates that a sequence can be aligned to the reference without evidence of an INDEL. Unlike the SEQUENCE_MATCH and SEQUENCE_MISMATCH operators, the ALIGNMENT_MATCH operator does not indicate whether the reference and read sequences are an exact match. This operator is equivalent to SAM’s M.
  • INSERT: The insert operator indicates that the read contains evidence of bases being inserted into the reference. This operator is equivalent to SAM’s I.
  • DELETE: The delete operator indicates that the read contains evidence of bases being deleted from the reference. This operator is equivalent to SAM’s D.
  • SKIP: The skip operator indicates that this read skips a long segment of the reference, but the bases have not been deleted. This operator is commonly used when working with RNA-seq data, where reads may skip long segments of the reference between exons. This operator is equivalent to SAM’s ‘N’.
  • CLIP_SOFT: The soft clip operator indicates that bases at the start/end of a read have not been considered during alignment. This may occur if the majority of a read maps, except for low quality bases at the start/end of a read. This operator is equivalent to SAM’s ‘S’. Bases that are soft clipped will still be stored in the read.
  • CLIP_HARD: The hard clip operator indicates that bases at the start/end of a read have been omitted from this alignment. This may occur if this linear alignment is part of a chimeric alignment, or if the read has been trimmed (e.g., during error correction, or to trim poly-A tails for RNA-seq). This operator is equivalent to SAM’s ‘H’.
  • PAD: The pad operator indicates that there is padding in an alignment. This operator is equivalent to SAM’s ‘P’.
  • SEQUENCE_MATCH: This operator indicates that this portion of the aligned sequence exactly matches the reference (e.g., all bases are equal to the reference bases). This operator is equivalent to SAM’s ‘=’.
  • SEQUENCE_MISMATCH: This operator indicates that this portion of the aligned sequence is an alignment match to the reference, but a sequence mismatch (e.g., the bases are not equal to the reference). This can indicate a SNP or a read error. This operator is equivalent to SAM’s ‘X’.
record CigarUnit
Fields:
  • operation (CigarOperation) – The operation type.
  • operationLength (long) – The number of bases that the operation runs for.
  • referenceSequence (null|string) –
    referenceSequence is only used at mismatches (SEQUENCE_MISMATCH)
    and deletions (DELETE). Filling this field replaces the MD tag. If the relevant information is not available, leave this field as null.

A structure for an instance of a CIGAR operation. FIXME: This belongs under Reads (only readAlignment refers to this)

record Experiment
Fields:
  • id (string) – The experiment UUID. This is globally unique.
  • name (null|string) – The name of the experiment.
  • description (null|string) – A description of the experiment.
  • recordCreateTime (string) –
    The time at which this record was created.
    Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
  • recordUpdateTime (string) –
    The time at which this record was last updated.
    Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
  • runTime (null|string) –
    The time at which this experiment was performed.
    Granularity here is variable (e.g. date only). Format: ISO 8601, YYYY-MM-DDTHH:MM:SS (e.g. 2015-02-10T00:03:42)
  • molecule (null|string) – The molecule examined in this experiment. (e.g. genomics DNA, total RNA)
  • strategy (null|string) –
    The experiment technique or strategy applied to the sample.
    (e.g. whole genome sequencing, RNA-seq, RIP-seq)
  • selection (null|string) –
    The method used to enrich the target. (e.g. immunoprecipitation, size
    fractionation, MNase digestion)
  • library (null|string) – The name of the library used as part of this experiment.
  • libraryLayout (null|string) – The configuration of sequenced reads. (e.g. Single or Paired)
  • instrumentModel (null|string) –
    The instrument model used as part of this experiment.
    This maps to sequencing technology in BAM.
  • instrumentDataFile (null|string) –
    The data file generated by the instrument.
    TODO: This isn’t actually a file is it? Should this be instrumentData instead?
  • sequencingCenter (null|string) – The sequencing center used as part of this experiment.
  • platformUnit (null|string) –
    The platform unit used as part of this experiment. This is a flowcell-barcode
    or slide unique identifier.
  • info (map<array<string>>) – A map of additional experiment information.

An experimental preparation of a sample.

record Dataset
Fields:
  • id (string) – The dataset’s id, locally unique to the server instance.
  • name (null|string) – The name of the dataset.
  • description (null|string) – Additional, human-readable information on the dataset.

A Dataset is a collection of related data of multiple types. Data providers decide how to group data into datasets. See [Metadata API](../api/metadata.html) for a more detailed discussion.