Reads¶
This file defines the objects used to represent a reads and alignments, most importantly ReadGroupSet, ReadGroup, and ReadAlignment. See {TODO: LINK TO READS OVERVIEW} for more information.
-
enum
Strand
¶ Symbols: NEG_STRAND|POS_STRAND Indicates the DNA strand associate for some data item. * NEG_STRAND: The negative (-) strand. * POS_STRAND: The postive (+) strand.
-
record
Position
¶ Fields: - referenceName (string) – The name of the Reference on which the Position is located.
- position (long) –
- The 0-based offset from the start of the forward strand for that Reference.
- Genomic positions are non-negative integers less than Reference length.
- strand (Strand) – Strand the position is associated with.
A Position is an unoriented base in some Reference. A Position is represented by a Reference name, and a base number on that Reference (0-based).
-
record
ExternalIdentifier
¶ Fields: - database (string) –
- The source of the identifier.
- (e.g. Ensembl)
- identifier (string) –
- The ID defined by the external database.
- (e.g. ENST00000000000)
- version (string) –
- The version of the object or the database
- (e.g. 78)
Identifier from a public database
- database (string) –
-
enum
CigarOperation
¶ Symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH An enum for the different types of CIGAR alignment operations that exist. Used wherever CIGAR alignments are used. The different enumerated values have the following usage:
- ALIGNMENT_MATCH: An alignment match indicates that a sequence can be aligned to the reference without evidence of an INDEL. Unlike the SEQUENCE_MATCH and SEQUENCE_MISMATCH operators, the ALIGNMENT_MATCH operator does not indicate whether the reference and read sequences are an exact match. This operator is equivalent to SAM’s M.
- INSERT: The insert operator indicates that the read contains evidence of bases being inserted into the reference. This operator is equivalent to SAM’s I.
- DELETE: The delete operator indicates that the read contains evidence of bases being deleted from the reference. This operator is equivalent to SAM’s D.
- SKIP: The skip operator indicates that this read skips a long segment of the reference, but the bases have not been deleted. This operator is commonly used when working with RNA-seq data, where reads may skip long segments of the reference between exons. This operator is equivalent to SAM’s ‘N’.
- CLIP_SOFT: The soft clip operator indicates that bases at the start/end of a read have not been considered during alignment. This may occur if the majority of a read maps, except for low quality bases at the start/end of a read. This operator is equivalent to SAM’s ‘S’. Bases that are soft clipped will still be stored in the read.
- CLIP_HARD: The hard clip operator indicates that bases at the start/end of a read have been omitted from this alignment. This may occur if this linear alignment is part of a chimeric alignment, or if the read has been trimmed (e.g., during error correction, or to trim poly-A tails for RNA-seq). This operator is equivalent to SAM’s ‘H’.
- PAD: The pad operator indicates that there is padding in an alignment. This operator is equivalent to SAM’s ‘P’.
- SEQUENCE_MATCH: This operator indicates that this portion of the aligned sequence exactly matches the reference (e.g., all bases are equal to the reference bases). This operator is equivalent to SAM’s ‘=’.
- SEQUENCE_MISMATCH: This operator indicates that this portion of the aligned sequence is an alignment match to the reference, but a sequence mismatch (e.g., the bases are not equal to the reference). This can indicate a SNP or a read error. This operator is equivalent to SAM’s ‘X’.
-
record
CigarUnit
¶ Fields: - operation (CigarOperation) – The operation type.
- operationLength (long) – The number of bases that the operation runs for.
- referenceSequence (null|string) –
- referenceSequence is only used at mismatches (SEQUENCE_MISMATCH)
- and deletions (DELETE). Filling this field replaces the MD tag. If the relevant information is not available, leave this field as null.
A structure for an instance of a CIGAR operation. FIXME: This belongs under Reads (only readAlignment refers to this)
-
record
Experiment
¶ Fields: - id (string) – The experiment UUID. This is globally unique.
- name (null|string) – The name of the experiment.
- description (null|string) – A description of the experiment.
- recordCreateTime (string) –
- The time at which this record was created.
- Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
- recordUpdateTime (string) –
- The time at which this record was last updated.
- Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
- runTime (null|string) –
- The time at which this experiment was performed.
- Granularity here is variable (e.g. date only). Format: ISO 8601, YYYY-MM-DDTHH:MM:SS (e.g. 2015-02-10T00:03:42)
- molecule (null|string) – The molecule examined in this experiment. (e.g. genomics DNA, total RNA)
- strategy (null|string) –
- The experiment technique or strategy applied to the sample.
- (e.g. whole genome sequencing, RNA-seq, RIP-seq)
- selection (null|string) –
- The method used to enrich the target. (e.g. immunoprecipitation, size
- fractionation, MNase digestion)
- library (null|string) – The name of the library used as part of this experiment.
- libraryLayout (null|string) – The configuration of sequenced reads. (e.g. Single or Paired)
- instrumentModel (null|string) –
- The instrument model used as part of this experiment.
- This maps to sequencing technology in BAM.
- instrumentDataFile (null|string) –
- The data file generated by the instrument.
- TODO: This isn’t actually a file is it? Should this be instrumentData instead?
- sequencingCenter (null|string) – The sequencing center used as part of this experiment.
- platformUnit (null|string) –
- The platform unit used as part of this experiment. This is a flowcell-barcode
- or slide unique identifier.
- info (map<array<string>>) – A map of additional experiment information.
An experimental preparation of a sample.
-
record
Dataset
¶ Fields: - id (string) – The dataset’s id, locally unique to the server instance.
- name (null|string) – The name of the dataset.
- description (null|string) – Additional, human-readable information on the dataset.
A Dataset is a collection of related data of multiple types. Data providers decide how to group data into datasets. See [Metadata API](../api/metadata.html) for a more detailed discussion.
-
record
Program
¶ Fields: - commandLine (null|string) – The command line used to run this program.
- id (null|string) – The user specified ID of the program.
- name (null|string) – The name of the program.
- prevProgramId (null|string) – The ID of the program run before this one.
- version (null|string) – The version of the program run.
Program can be used to track the provenance of how read data was generated.
-
record
ReadStats
¶ Fields: - alignedReadCount (null|long) – The number of aligned reads.
- unalignedReadCount (null|long) – The number of unaligned reads.
- baseCount (null|long) –
- The total number of bases.
- This is equivalent to the sum of alignedSequence.length for all reads.
ReadStats can be used to provide summary statistics about read data.
-
record
ReadGroup
¶ Fields: - id (string) – The read group ID.
- datasetId (null|string) – The ID of the dataset this read group belongs to.
- name (null|string) – The read group name.
- description (null|string) – The read group description.
- sampleId (null|string) –
- The sample this read group’s data was generated from.
- Note: the current API does not have a rigorous definition of sample. Therefore, this field actually contains an arbitrary string, typically corresponding to the SM tag in a BAM file.
- experiment (null|Experiment) – The experiment used to generate this read group.
- predictedInsertSize (null|int) – The predicted insert size of this read group.
- created (null|long) – The time at which this read group was created in milliseconds from the epoch.
- updated (null|long) –
- The time at which this read group was last updated in milliseconds
- from the epoch.
- stats (null|ReadStats) – Statistical data on reads in this read group.
- programs (array<Program>) – The programs used to generate this read group.
- referenceSetId (null|string) –
- The ID of the reference set to which the reads in this read group are aligned.
- Required if there are any read alignments.
- info (map<array<string>>) – A map of additional read group information.
A ReadGroup is a set of reads derived from one physical sequencing process.
-
record
ReadGroupSet
¶ Fields: - id (string) – The read group set ID.
- datasetId (null|string) – The ID of the dataset this read group set belongs to.
- name (null|string) – The read group set name.
- stats (null|ReadStats) – Statistical data on reads in this read group set.
- readGroups (array<ReadGroup>) – The read groups in this set.
A ReadGroupSet is a logical collection of ReadGroups. Typically one ReadGroupSet represents all the reads from one experimental sample.
-
record
LinearAlignment
¶ Fields: - position (Position) – The position of this alignment.
- mappingQuality (null|int) –
- The mapping quality of this alignment, meaning the likelihood that the read
- maps to this position.
Specifically, this is -10 log10 Pr(mapping position is wrong), rounded to the nearest integer.
- cigar (array<CigarUnit>) –
- Represents the local alignment of this sequence (alignment matches, indels, etc)
- versus the reference.
A linear alignment describes the alignment of a read to a Reference, using a position and CIGAR array.
-
record
Fragment
¶ Fields: - id (string) – The fragment ID.
A fragment represents a contiguous stretch of a DNA or RNA molecule. Reads can be associated with a fragment to specify they derive from the same molecule. TODO: this Fragment object is essentially unused, and may be removed in a future PR.
-
record
ReadAlignment
¶ Fields: - id (null|string) –
- The read alignment ID. This ID is unique within the read group this
- alignment belongs to.
For performance reasons, this field may be omitted by a backend. If provided, its intended use is to make caching and UI display easier for genome browsers and other lightweight clients.
- readGroupId (string) –
- The ID of the read group this read belongs to.
- (Every read must belong to exactly one read group.)
- fragmentId (string) –
- The fragment ID that this ReadAlignment belongs to.
- TODO: this is the only reference to the Fragment object, which may be removed in a future PR.
- fragmentName (string) – The fragment name. Equivalent to QNAME (query template name) in SAM.
- properPlacement (null|boolean) –
- The orientation and the distance between reads from the fragment are
- consistent with the sequencing protocol (equivalent to SAM flag 0x2)
- duplicateFragment (null|boolean) – The fragment is a PCR or optical duplicate (SAM flag 0x400).
- numberReads (null|int) – The number of reads in the fragment (extension to SAM flag 0x1)
- fragmentLength (null|int) – The observed length of the fragment, equivalent to TLEN in SAM.
- readNumber (null|int) –
- The read ordinal in the fragment, 0-based and less than numberReads. This
- field replaces SAM flag 0x40 and 0x80 and is intended to more cleanly represent multiple reads per fragment.
- failedVendorQualityChecks (null|boolean) – The read fails platform or vendor quality checks (SAM flag 0x200).
- alignment (null|LinearAlignment) –
- The alignment for this alignment record. This field will be null if the read
- is unmapped.
- secondaryAlignment (null|boolean) –
- Whether this alignment is secondary. Equivalent to SAM flag 0x100.
- A secondary alignment represents an alternative to the primary alignment
for this read. Aligners may return secondary alignments if a read can map
ambiguously to multiple coordinates in the genome.
By convention, each read has one and only one alignment where both secondaryAlignment and supplementaryAlignment are false.
- supplementaryAlignment (null|boolean) –
- Whether this alignment is supplementary. Equivalent to SAM flag 0x800.
- Supplementary alignments are used in the representation of a chimeric
alignment. In a chimeric alignment, a read is split into multiple
linear alignments that map to different reference contigs. The first
linear alignment in the read will be designated as the representative alignment;
the remaining linear alignments will be designated as supplementary alignments.
These alignments may have different mapping quality scores.
In each linear alignment in a chimeric alignment, the read will be hard clipped. The alignedSequence and alignedQuality fields in the alignment record will only represent the bases for its respective linear alignment.
- alignedSequence (null|string) –
- The bases of the read sequence contained in this alignment record (equivalent
- to SEQ in SAM).
alignedSequence and alignedQuality may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.
- alignedQuality (array<int>) –
- The quality of the read sequence contained in this alignment record
- (equivalent to QUAL in SAM).
alignedSequence and alignedQuality may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.
- nextMatePosition (null|Position) –
- The mapping of the primary alignment of the (readNumber+1)%numberReads
- read in the fragment. It replaces mate position and mate strand in SAM.
- info (map<array<string>>) – A map of additional read alignment information.
Each read alignment describes an alignment with additional information about the fragment and the read. A read alignment object is equivalent to a line in a SAM file.
- id (null|string) –