reads

This file defines the objects used to represent a reads and alignments, most importantly ReadGroupSet, ReadGroup, and ReadAlignment. See {TODO: LINK TO READS OVERVIEW} for more information.

NB: we require that all readgroups in the set are mapped to the same referenceSet.

message ReadStats
Fields:
  • aligned_read_count (long) – The number of aligned reads.
  • unaligned_read_count (long) – The number of unaligned reads.
  • base_count (long) – The total number of bases. This is equivalent to the sum of alignedSequence.length for all reads.

ReadStats can be used to provide summary statistics about read data.

message ReadGroup
Fields:
  • id (string) – The read group ID.
  • dataset_id (string) – The ID of the dataset this read group belongs to.
  • name (string) – The read group name.
  • description (string) – The read group description.
  • sample_name (string) – A name for the sample this read group’s data were generated from. This field contains an arbitrary string, typically corresponding to the SM tag in a BAM file.
  • biosample_id (string) – The Biosample this read group’s data was generated from.
  • experiment (Experiment) – The experiment used to generate this read group.
  • predicted_insert_size (integer) – The predicted insert size of this read group.
  • created (long) – The time at which this read group was created in milliseconds from the epoch.
  • updated (long) – The time at which this read group was last updated in milliseconds from the epoch.
  • stats (ReadStats) – Statistical data on reads in this read group.
  • programs (list of Program) – Program can be used to track the provenance of how read data was generated.
  • reference_set_id (string) – The ID of the reference set to which the reads in this read group are aligned. Required if there are any read alignments.
  • attributes (Attributes) – A map of additional information about the Read Group.

A ReadGroup is a set of reads derived from one physical sequencing process.

message ReadGroupSet
Fields:
  • id (string) – The read group set ID.
  • dataset_id (string) – The ID of the dataset this read group set belongs to.
  • name (string) – The read group set name.
  • stats (ReadStats) – Statistical data on reads in this read group set.
  • read_groups (list of ReadGroup) – The read groups in this set.
  • attributes (Attributes) – A map of additional information about the Read Group Set.

A ReadGroupSet is a logical collection of ReadGroups. Typically one ReadGroupSet represents all the reads from one experimental sample.

message LinearAlignment
Fields:
  • position (Position) – The position of this alignment.
  • mapping_quality (integer) –

    The mapping quality of this alignment, meaning the likelihood that the read maps to this position.

    Specifically, this is -10 log10 Pr(mapping position is wrong), rounded to the nearest integer.

  • cigar (list of CigarUnit) – Represents the local alignment of this sequence (alignment matches, indels, etc) versus the reference.

A linear alignment describes the alignment of a read to a Reference, using a position and CIGAR array.

message ReadAlignment
Fields:
  • id (string) –

    The read alignment ID. This ID is unique within the read group this alignment belongs to.

    For performance reasons, this field may be omitted by a backend. If provided, its intended use is to make caching and UI display easier for genome browsers and other lightweight clients.

  • read_group_id (string) – The ID of the read group this read belongs to. (Every read must belong to exactly one read group.)
  • fragment_name (string) – The fragment name. Equivalent to QNAME (query template name) in SAM.
  • improper_placement (boolean) – The orientation and the distance between reads from the fragment are inconsistent with the sequencing protocol (inverse of SAM flag 0x2).
  • duplicate_fragment (boolean) – The fragment is a PCR or optical duplicate (SAM flag 0x400).
  • number_reads (integer) – The number of reads in the fragment (extension to SAM flag 0x1).
  • fragment_length (integer) – The observed length of the fragment, equivalent to TLEN in SAM.
  • read_number (integer) – The read ordinal in the fragment, 0-based and less than numberReads. This field replaces SAM flag 0x40 and 0x80 and is intended to more cleanly represent multiple reads per fragment.
  • failed_vendor_quality_checks (boolean) – The read fails platform or vendor quality checks (SAM flag 0x200).
  • alignment (LinearAlignment) – The alignment for this alignment message. This field will be null if the read is unmapped.
  • secondary_alignment (boolean) –

    Whether this alignment is secondary. Equivalent to SAM flag 0x100. A secondary alignment represents an alternative to the primary alignment for this read. Aligners may return secondary alignments if a read can map ambiguously to multiple coordinates in the genome.

    By convention, each read has one and only one alignment where both secondaryAlignment and supplementaryAlignment are false.

  • supplementary_alignment (boolean) –

    Whether this alignment is supplementary. Equivalent to SAM flag 0x800. Supplementary alignments are used in the representation of a chimeric alignment. In a chimeric alignment, a read is split into multiple linear alignments that map to different reference contigs. The first linear alignment in the read will be designated as the representative alignment; the remaining linear alignments will be designated as supplementary alignments. These alignments may have different mapping quality scores.

    In each linear alignment in a chimeric alignment, the read will be hard clipped. The alignedSequence and alignedQuality fields in the alignment message will only represent the bases for its respective linear alignment.

  • aligned_sequence (string) –

    The bases of the read sequence contained in this alignment record (equivalent to SEQ in SAM).

    alignedSequence and alignedQuality may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.

  • aligned_quality (integer) –

    The quality of the read sequence contained in this alignment message (equivalent to QUAL in SAM).

    alignedSequence and alignedQuality may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.

  • next_mate_position (Position) – The mapping of the primary alignment of the (readNumber+1)%numberReads read in the fragment. It replaces mate position and mate strand in SAM.
  • attributes (Attributes) – A map of additional information about the Alignment.

Each read alignment describes an alignment with additional information about the fragment and the read. A read alignment object is equivalent to a line in a SAM file.

message CigarUnit
Fields:
  • operation (Operation) –
  • operation_length (long) – The number of genomic bases that the operation runs for. Required.
  • reference_sequence (string) – referenceSequence is only used at mismatches (SEQUENCE_MISMATCH) and deletions (DELETE). Filling this field replaces SAM’s MD tag. If the relevant information is not available, this field is unset.

A single CIGAR operation.

enum Operation
Symbols:OPERATION_UNSPECIFIED|ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH
  • OPERATION_UNSPECIFIED:
  • ALIGNMENT_MATCH:
  • INSERT:
  • DELETE:
  • SKIP:
  • CLIP_SOFT:
  • CLIP_HARD:
  • PAD:
  • SEQUENCE_MATCH:
  • SEQUENCE_MISMATCH: