Reads

This file defines the objects used to represent a reads and alignments, most importantly ReadGroupSet, ReadGroup, and ReadAlignment. See {TODO: LINK TO READS OVERVIEW} for more information.

enum Strand
Symbols:NEG_STRAND|POS_STRAND

Indicates the DNA strand associate for some data item. * NEG_STRAND: The negative (-) strand. * POS_STRAND: The postive (+) strand.

record Position
Fields:
  • referenceName (string) – The name of the Reference on which the Position is located.
  • position (long) –
    The 0-based offset from the start of the forward strand for that Reference.
    Genomic positions are non-negative integers less than Reference length.
  • strand (Strand) – Strand the position is associated with.

A Position is an unoriented base in some Reference. A Position is represented by a Reference name, and a base number on that Reference (0-based).

record ExternalIdentifier
Fields:
  • database (string) –
    The source of the identifier.
    (e.g. Ensembl)
  • identifier (string) –
    The ID defined by the external database.
    (e.g. ENST00000000000)
  • version (string) –
    The version of the object or the database
    (e.g. 78)

Identifier from a public database

enum CigarOperation
Symbols:ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH

An enum for the different types of CIGAR alignment operations that exist. Used wherever CIGAR alignments are used. The different enumerated values have the following usage:

  • ALIGNMENT_MATCH: An alignment match indicates that a sequence can be aligned to the reference without evidence of an INDEL. Unlike the SEQUENCE_MATCH and SEQUENCE_MISMATCH operators, the ALIGNMENT_MATCH operator does not indicate whether the reference and read sequences are an exact match. This operator is equivalent to SAM’s M.
  • INSERT: The insert operator indicates that the read contains evidence of bases being inserted into the reference. This operator is equivalent to SAM’s I.
  • DELETE: The delete operator indicates that the read contains evidence of bases being deleted from the reference. This operator is equivalent to SAM’s D.
  • SKIP: The skip operator indicates that this read skips a long segment of the reference, but the bases have not been deleted. This operator is commonly used when working with RNA-seq data, where reads may skip long segments of the reference between exons. This operator is equivalent to SAM’s ‘N’.
  • CLIP_SOFT: The soft clip operator indicates that bases at the start/end of a read have not been considered during alignment. This may occur if the majority of a read maps, except for low quality bases at the start/end of a read. This operator is equivalent to SAM’s ‘S’. Bases that are soft clipped will still be stored in the read.
  • CLIP_HARD: The hard clip operator indicates that bases at the start/end of a read have been omitted from this alignment. This may occur if this linear alignment is part of a chimeric alignment, or if the read has been trimmed (e.g., during error correction, or to trim poly-A tails for RNA-seq). This operator is equivalent to SAM’s ‘H’.
  • PAD: The pad operator indicates that there is padding in an alignment. This operator is equivalent to SAM’s ‘P’.
  • SEQUENCE_MATCH: This operator indicates that this portion of the aligned sequence exactly matches the reference (e.g., all bases are equal to the reference bases). This operator is equivalent to SAM’s ‘=’.
  • SEQUENCE_MISMATCH: This operator indicates that this portion of the aligned sequence is an alignment match to the reference, but a sequence mismatch (e.g., the bases are not equal to the reference). This can indicate a SNP or a read error. This operator is equivalent to SAM’s ‘X’.
record CigarUnit
Fields:
  • operation (CigarOperation) – The operation type.
  • operationLength (long) – The number of bases that the operation runs for.
  • referenceSequence (null|string) –
    referenceSequence is only used at mismatches (SEQUENCE_MISMATCH)
    and deletions (DELETE). Filling this field replaces the MD tag. If the relevant information is not available, leave this field as null.

A structure for an instance of a CIGAR operation. FIXME: This belongs under Reads (only readAlignment refers to this)

record Experiment
Fields:
  • id (string) – The experiment UUID. This is globally unique.
  • name (null|string) – The name of the experiment.
  • description (null|string) – A description of the experiment.
  • recordCreateTime (string) –
    The time at which this record was created.
    Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
  • recordUpdateTime (string) –
    The time at which this record was last updated.
    Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
  • runTime (null|string) –
    The time at which this experiment was performed.
    Granularity here is variable (e.g. date only). Format: ISO 8601, YYYY-MM-DDTHH:MM:SS (e.g. 2015-02-10T00:03:42)
  • molecule (null|string) – The molecule examined in this experiment. (e.g. genomics DNA, total RNA)
  • strategy (null|string) –
    The experiment technique or strategy applied to the sample.
    (e.g. whole genome sequencing, RNA-seq, RIP-seq)
  • selection (null|string) –
    The method used to enrich the target. (e.g. immunoprecipitation, size
    fractionation, MNase digestion)
  • library (null|string) – The name of the library used as part of this experiment.
  • libraryLayout (null|string) – The configuration of sequenced reads. (e.g. Single or Paired)
  • instrumentModel (null|string) –
    The instrument model used as part of this experiment.
    This maps to sequencing technology in BAM.
  • instrumentDataFile (null|string) –
    The data file generated by the instrument.
    TODO: This isn’t actually a file is it? Should this be instrumentData instead?
  • sequencingCenter (null|string) – The sequencing center used as part of this experiment.
  • platformUnit (null|string) –
    The platform unit used as part of this experiment. This is a flowcell-barcode
    or slide unique identifier.
  • info (map<array<string>>) – A map of additional experiment information.

An experimental preparation of a sample.

record Dataset
Fields:
  • id (string) – The dataset’s id, locally unique to the server instance.
  • name (null|string) – The name of the dataset.
  • description (null|string) – Additional, human-readable information on the dataset.

A Dataset is a collection of related data of multiple types. Data providers decide how to group data into datasets. See [Metadata API](../api/metadata.html) for a more detailed discussion.

record Program
Fields:
  • commandLine (null|string) – The command line used to run this program.
  • id (null|string) – The user specified ID of the program.
  • name (null|string) – The name of the program.
  • prevProgramId (null|string) – The ID of the program run before this one.
  • version (null|string) – The version of the program run.

Program can be used to track the provenance of how read data was generated.

record ReadStats
Fields:
  • alignedReadCount (null|long) – The number of aligned reads.
  • unalignedReadCount (null|long) – The number of unaligned reads.
  • baseCount (null|long) –
    The total number of bases.
    This is equivalent to the sum of alignedSequence.length for all reads.

ReadStats can be used to provide summary statistics about read data.

record ReadGroup
Fields:
  • id (string) – The read group ID.
  • datasetId (null|string) – The ID of the dataset this read group belongs to.
  • name (null|string) – The read group name.
  • description (null|string) – The read group description.
  • sampleId (null|string) –
    The sample this read group’s data was generated from.
    Note: the current API does not have a rigorous definition of sample. Therefore, this field actually contains an arbitrary string, typically corresponding to the SM tag in a BAM file.
  • experiment (null|Experiment) – The experiment used to generate this read group.
  • predictedInsertSize (null|int) – The predicted insert size of this read group.
  • created (null|long) – The time at which this read group was created in milliseconds from the epoch.
  • updated (null|long) –
    The time at which this read group was last updated in milliseconds
    from the epoch.
  • stats (null|ReadStats) – Statistical data on reads in this read group.
  • programs (array<Program>) – The programs used to generate this read group.
  • referenceSetId (null|string) –
    The ID of the reference set to which the reads in this read group are aligned.
    Required if there are any read alignments.
  • info (map<array<string>>) – A map of additional read group information.

A ReadGroup is a set of reads derived from one physical sequencing process.

record ReadGroupSet
Fields:
  • id (string) – The read group set ID.
  • datasetId (null|string) – The ID of the dataset this read group set belongs to.
  • name (null|string) – The read group set name.
  • stats (null|ReadStats) – Statistical data on reads in this read group set.
  • readGroups (array<ReadGroup>) – The read groups in this set.

A ReadGroupSet is a logical collection of ReadGroups. Typically one ReadGroupSet represents all the reads from one experimental sample.

record LinearAlignment
Fields:
  • position (Position) – The position of this alignment.
  • mappingQuality (null|int) –
    The mapping quality of this alignment, meaning the likelihood that the read
    maps to this position.

    Specifically, this is -10 log10 Pr(mapping position is wrong), rounded to the nearest integer.

  • cigar (array<CigarUnit>) –
    Represents the local alignment of this sequence (alignment matches, indels, etc)
    versus the reference.

A linear alignment describes the alignment of a read to a Reference, using a position and CIGAR array.

record Fragment
Fields:
  • id (string) – The fragment ID.

A fragment represents a contiguous stretch of a DNA or RNA molecule. Reads can be associated with a fragment to specify they derive from the same molecule. TODO: this Fragment object is essentially unused, and may be removed in a future PR.

record ReadAlignment
Fields:
  • id (null|string) –
    The read alignment ID. This ID is unique within the read group this
    alignment belongs to.

    For performance reasons, this field may be omitted by a backend. If provided, its intended use is to make caching and UI display easier for genome browsers and other lightweight clients.

  • readGroupId (string) –
    The ID of the read group this read belongs to.
    (Every read must belong to exactly one read group.)
  • fragmentId (string) –
    The fragment ID that this ReadAlignment belongs to.
    TODO: this is the only reference to the Fragment object, which may be removed in a future PR.
  • fragmentName (string) – The fragment name. Equivalent to QNAME (query template name) in SAM.
  • properPlacement (null|boolean) –
    The orientation and the distance between reads from the fragment are
    consistent with the sequencing protocol (equivalent to SAM flag 0x2)
  • duplicateFragment (null|boolean) – The fragment is a PCR or optical duplicate (SAM flag 0x400).
  • numberReads (null|int) – The number of reads in the fragment (extension to SAM flag 0x1)
  • fragmentLength (null|int) – The observed length of the fragment, equivalent to TLEN in SAM.
  • readNumber (null|int) –
    The read ordinal in the fragment, 0-based and less than numberReads. This
    field replaces SAM flag 0x40 and 0x80 and is intended to more cleanly represent multiple reads per fragment.
  • failedVendorQualityChecks (null|boolean) – The read fails platform or vendor quality checks (SAM flag 0x200).
  • alignment (null|LinearAlignment) –
    The alignment for this alignment record. This field will be null if the read
    is unmapped.
  • secondaryAlignment (null|boolean) –
    Whether this alignment is secondary. Equivalent to SAM flag 0x100.
    A secondary alignment represents an alternative to the primary alignment for this read. Aligners may return secondary alignments if a read can map ambiguously to multiple coordinates in the genome.

    By convention, each read has one and only one alignment where both secondaryAlignment and supplementaryAlignment are false.

  • supplementaryAlignment (null|boolean) –
    Whether this alignment is supplementary. Equivalent to SAM flag 0x800.
    Supplementary alignments are used in the representation of a chimeric alignment. In a chimeric alignment, a read is split into multiple linear alignments that map to different reference contigs. The first linear alignment in the read will be designated as the representative alignment; the remaining linear alignments will be designated as supplementary alignments. These alignments may have different mapping quality scores.

    In each linear alignment in a chimeric alignment, the read will be hard clipped. The alignedSequence and alignedQuality fields in the alignment record will only represent the bases for its respective linear alignment.

  • alignedSequence (null|string) –
    The bases of the read sequence contained in this alignment record (equivalent
    to SEQ in SAM).

    alignedSequence and alignedQuality may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.

  • alignedQuality (array<int>) –
    The quality of the read sequence contained in this alignment record
    (equivalent to QUAL in SAM).

    alignedSequence and alignedQuality may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.

  • nextMatePosition (null|Position) –
    The mapping of the primary alignment of the (readNumber+1)%numberReads
    read in the fragment. It replaces mate position and mate strand in SAM.
  • info (map<array<string>>) – A map of additional read alignment information.

Each read alignment describes an alignment with additional information about the fragment and the read. A read alignment object is equivalent to a line in a SAM file.