This protocol defines the associations between genotype and phenotype (G2P). Associations can be made as a result of literature curation, computational modeling, inference, etc., and modeled and shared using this schema.

Here, we follow the dogma of: Genotype + Environment = Phenotype

Where a G2P association is between the G(enotype) in the context of some E(environment), which gives rise to a P(henotype). These associations have further evidence, provenance, and attribution. We leverage the GenomicFeature in the sequenceAnnotation schema here as it can accomodate any genomic feature from a single nucleotide variation (SNV), up through a gene, and/or complex rearrangements. Each can be modeled as genomic features, and generally linked to a phenotype. Collections of these features can represent a genotype at different levels of completeness. Therefore, we can represent single allelic variation, allelic complement, and multiple variants in a genotype that can each or collectively be associated with a phenotype. To enable standardized integration, this schema relies heavily on OntologyTerms, for typing phenotype, genomic features, and levels of evidence. Suggested ontologies to leverage include (with browser links):

message PhenotypeAssociationSet
  • id (string) – The phenotype association set ID.
  • name (string) – The phenotype association set name.
  • dataset_id (string) – The ID of the dataset this phenotype association set belongs to.
  • info (map< string , ListValue >) – Optional additional information for this phenotype association set.

The top level container for phenotype association data.

message EnvironmentalContext

The context in which a genotype gives rise to a phenotype. This is fairly open-ended; as a stub we have a simple ontology term. For example, a controlled term for a drug, or perhaps an instance of a complex environment including temperature and air quality, or perhaps the anatomical environment (gut vs tissue type vs whole organism).

message PhenotypeInstance

An association to a phenotype and related information. This record is intended primarily to be used in conjunction with variants, but the record can also be composed with other kinds of entities such as diseases

message Evidence
  • evidence_type (OntologyTerm) – ECO or OBI is recommended
  • description (string) – A textual description of the evidence. This is used to complement the structured description in the evidence_type field
  • info (map< string , ListValue >) – Additional annotation data in key-value pairs.

Evidence for the phenotype association. This is also a stub for further expansion. We should consider moving this into it’s own schema.

message FeaturePhenotypeAssociation
  • id (string) – A unique identifier for the association.
  • phenotype_association_set_id (string) – The ID of the PhenotypeAssociationSet this FeaturePhenotypeAssociation belongs to.
  • feature_ids (string) – The set of features of the organism that bears the phenotype. This could be as complete as a full complement of variants, or as minimal as the confirmed variants that are known causation for the annotated phenotype. Examples of features could be variations at the nucleotide level, large rearrangements at the chromosome level, or relevant epigenetic markers. Relevant genomic feature types are suggested to be those typed in the Sequence Ontology (SO). The feature set can have only one item, and must not be null.
  • evidence (list of Evidence) – The evidence for this specific instance of association between the features and the phenotype.
  • phenotype (PhenotypeInstance) – The phenotypic component of this association.
  • description (string) – A textual description of the association.
  • environmental_contexts (list of EnvironmentalContext) – The context in which the phenotype arises. Multiple contexts can be specified - these are assumed to all hold together
  • info (map< string , ListValue >) – Additional annotation data in key-value pairs.

An association between one or more genomic features and a phenotype. The instance of association allows us to link a feature to a phenotype, multiple times, each bearing potentially different levels of confidence, such as resulting from alternative experiments and analysis.