Genotype To Phenotype API

Summary

This API allows users to search for genotype-phenotype associations in a GA4GH datastore. The user can search for associations by building queries composed of features, phenotypes, and/or evidence terms. The API is designed to accommodate search terms specified as either a string, external identifier, ontology identifier, or as an ‘entity’ (See Data Model section). These terms are combined as an AND of (feature && phenotype && evidence). This flexibility in the schema allows a variety of data to be stored in the database and allows users to express a wide range of queries.

Users will receive an array of associations as a response. Associations contain description and environment fields in addition to the relevant feature, phenotype, and evidence fields for that instance of association.

Multiple server collation - Background

G2P servers are planned to be implemented in three different contexts:

  • As a wrapper around standalone local G2P “knowledge bases” (eg Monarch, CiVIC,etc). Important considerations are the API needs to function independently of other parts of the API and separately from any specific omics dataset. Often, these databases are not curated with complete Feature fields (referenceName, start, end, strand)
image
  • Coupled with sequence annotation and GA4GH datasets. Clients will want implementation specific featureId/genotypeId to match and integrate with the rest of the APIs.
image
  • Operating in concert with other instances of g2p servers where the client’s loosely federated query is supported by heterogeneous server. Challenges: Normalizing API behavior across implementations (featureId for given region different per implementation)
image

Approach

We based our original work on the model captured in ga4gh/ga4gh-schemas commit of Jul 30, 2015. This version of the schema predates the separated genotype to phenotype files from baseline. After on review of the schemas and code, the team had feedback about separation of responsibility in the original API. The API was refactored to separate the searches for genotype, phenotype, feature and associations.

Data Model

The cancer genome database Clinical Genomics Knowledge Base published by the Monarch project was the source of Evidence.

image

Intent: The GA4GH Ontology schema provides structures for unambiguous references to ontological concepts and/or controlled vocabularies within Protocol Buffers. The structures provided are not intended for de novo modeling of ontologies, or representing complete ontologies within Protocol Buffers. References to e.g. classes from external ontologies or controlled vocabularies should be interpreted only in their original context i.e. the source ontology.

Due to the flexibility of the data model, users have a number of options for specifying each query term feature, phenotype, and evidence.

API

The G2P schemas define several endpoints broken into two entity searches and an association search.

A feature or phenotype can potentially be represented in increasing specificity as either [a string, an ontology identifier, an external identifier, or as a feature ‘entity’]. One criticism of the previous API is that it was overloaded, violating the design goal of separation of concerns. Specifically it combines the search for evidence with search for features & search for genotypes.

The refactored API moves search, alias matching and external identifiers lookup to dedicated end points. To separate concens, a client performs the queries for evidence in two steps: first find the desired entities and then use those enitity identifiers to narrow the search for evidence.

Additionally the API supports two implementation styles: integrated and standalone.

sequence

Entity Searches

  • /features/search
    • Given a SearchFeaturesRequest, return matching features in the current 'omics dataset. Intended for sequence annotation and GA4GH datasets.
  • /phenotypes/search
    • Given a SearchPhenotypesRequest, return matching phenotypes in the in the current g2p dataset.

Usage

  1. As a GA4GH client, use entity queries for the genotypes and phenotypes you are interested in.
  2. Create an association search using the entity identifiers from step 1.
  3. Repeat 1-2 as necessary, collating responses on the client.

Many types rely heavily on the concept of an OntologyTerm (see end of document for discussion on usage of OntologyTerms).

Implementation

image

Source Code

  • Front End ‘/features/search’, ‘/datasets/<datasetId>/features/search’, ‘/phenotypes/search’, ‘/featurephenotypeassociations/search’
  • Back End ‘runSearchFeatures’, ‘runSearchGenotypePhenotypes’, ‘runSearchPhenotypes’, ‘runSearchGenotypes’
  • Datamodel ‘getAssociations’ Datamodel ‘getAssociations’ (Features)

Tests

Help Wanted: Any or all use cases and scenarios

Acceptance

  • Submittal of 3 simultaneous pull-requests for server, schema and compliance repositories
  • 2 +1s for each repository from outside the development team
  • Additional 3 day review for schemas

API Details and Examples



/features/search

See sequence annotations documentation.


Use cases

  1. As a clinician or a genomics researcher, I may have a patient with Gastrointestinal stromal tumor, GIST, and a proposed drug for treatment, imatinib. In order to identify whether the patient would respond well to treatment with the drug, I need a list of features (e.g. genes) which are associated with the sensitivity of GIST to imatinib. Suppose I am specifically interested in a gene, KIT, which is implicated in the pathogenesis of several cancer types. I could submit a query to /featurephenotypeassociations/search with GIST as the phenotype, KIT as the feature, and clinical study evidence <http://purl.obolibrary.org/obo/ECO_0000180> as the evidence.

    In response, I will receive back a list of associations involving GIST and KIT, which I can filter for instances where imatinib is mentioned. URI’s in the associations field could - hypothetically - be followed to discover that GIST patients with wild-type *KIT* have decreased sensitivity to therapy with imatinib   <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2651076/>.

    If I left both the feature and evidence fields as null, I would receive back all associations which involve GIST as a phenotype.

  2. As a non-Hodgkin’s lymphoma researcher, I may know that the gene CD20 has an abnormal expression in Hodgkin's lymphoma <http://purl.obolibrary.org/obo/DOID_8567>. I might be interested in knowing whether CD20 also has an abnormal expression in non-Hodgkin lymphoma <http://purl.obolibrary.org/obo/DOID_0060060>. Therefore I could perform a query with CD20 as a feature, non-Hodgkin’s lymphoma as a phenotype, and RNA sequencing <http://purl.obolibrary.org/obo/OBI_0001177> as the evidence type.

  3. As a genetic counselor, I may be wondering if a mutation in one of my clients’ genes has ever been associated with a disease. I could then do a query based on the gene name as the feature and disease <http://purl.obolibrary.org/obo/DOID_4> as the phenotype.

For specifics of the json representations, please see the server <https://github.com/ga4gh/ga4gh-server> and compliance <https://github.com/ga4gh/compliance> repositories.

Ontologies

Usage: Multiple ontology terms can be supplied e.g. to describe a series of phenotypes for a specific sample. The OntologyTerm message is not intended to model relationships between terms, or to provide mappings between ontologies for the same concept. Should an OntologyTerm be unavailable, or terms unmapped then an ‘annotation’ can be provided which can later be mapped to an ontology term using a service designed for this. Using OntologyTerm is preferred to using Annotation. Though annotations can be supplied with related ontology terms if desired. A use case could be when a free text annotation is very specific and a more general OntologyTerm is supplied.

Read more about Ontology Terms


Directions for future capabilities.

Flexible representation of Feature

  • Q: I need to lookup Feature by proteinName or other external id. How do look them up?
    Currently, sequence annotation’s features/search supports search by name or location. Future versions should implement lookup by alias/
  • Q: I have results from multiple G2P Servers. How do I collate them across datasets and implementations?
    This is a subject for the investigation as we create a federation of G2P servers. The responsibility for collating features and associations across servers. One strategy might be to use HGVS’ DNA annotation for as a neutral identifier for feature.

Expanding scope to entities other than Feature

Consider instead a PhenotypeAssociation which has a wider scope; the objects it connects and the evidence type determines the meaning of the association

image