biolink/biolink-model

Name: biolink-model

Owner: biolink

Description: Schema and generated objects for biolink data model and upper ontology

Created: 2017-12-04 04:45:18.0

Updated: 2018-01-09 18:20:26.0

Pushed: 2018-01-10 20:31:26.0

Homepage:

Size: 34461

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Build Status

biolink-models

A high level datamodel of biological entities (genes, diseases, phenotypes, pathways, individuals, substances, etc) and their associations.

The immediate goal is to provide a reference datamodel that is independent of storage technology (solr, neo4j, csvs, etc). This reference data model can be used for a variety of purposes:

The specification of the reference biolink model is a single YAML file following a custom meta-model. The basic elements of the YAML are:

Organization

The datamodel source is biolink-model.yaml. This is a yaml file that is intended to be relatively simple to view and edit in its native form.

The yaml definition is currently used to derive:

We leverage existing frameworks where possible. E.g json-schema allows codegen to other languages

TODO:

Additionally, this repo contains the metamodel definition of itself in yaml, together with code for working with datamodels. In theory this could be used in other domains but there is no plan for this at the moment.

Metamodel

See metamodel for details of the metamodel.

Usage in existing projects
Case study: gene expression in Monarch

Currently this is documented in the ingest artefacts repo, using non-computable cmap images:

bgee model

And also by the gene-anatomy cypher query which maps graphs conforming to the pattern to denormalized tuples for indexing in solr

in the biolink model this is explicitly represented using the gene to expression site association class definition in the model

name: gene to expression site association
is_a: association
description: >-
  An association between a gene and an expression site, possibly qualified by stage/timing info
see_also: "https://github.com/monarch-initiative/ingest-artifacts/tree/master/sources/BGee"
slot_usage:
  - slot: subject
    type: gene or gene product
    description: "gene in which variation is correlated with the phenotypic feature"
  - slot: object
    type: anatomical entity
    description: "location in which the gene is expressed"
    subclass_of: UBERON:0001062
    examples:
      - value: UBERON:0002037
        description: cerebellum
  - slot: relation
    description: "expression relationship"
    subproperty_of: "RO:0002206"
  - slot: stage
    type: developmental stage
    description: "stage at which the gene is expressed in the site"
    examples:
      - value: UBERON:0000069
        description: larval stage
  - slot: quantifier
    description: >-
      can be used to indicate magnitude, or also ranking

This is used to generate various artefacts such as

Auto-generated image:

img

 GeneToExpressionSiteAssociation {
alifiers: [String]
ageQualifier: LifeStage
jectExtensions: [PropertyValuePair]
sEvidence: String
blications: [Publication]
ject: AnatomicalEntity!
sEvidenceType: EvidenceType
sEvidenceGraph: String
ovidedBy: Provider
bel: String
lation: String!
gated: String
bject: GeneOrGeneProduct!
: String!
antifierQualifier: String
sociationType: String
bjectExtensions: [PropertyValuePair]

snippet of generated json-schema

    "GeneToExpressionSiteAssociation": {
        "description": "An association between a gene and an expression site, possibly qualified by stage/timing info. TBD: introduce subclasses for distinction between wild-type and experimental conditions?",
        "properties": {
            "association_type": {
                "description": "connects an association to the type of association (e.g. gene to phenotype)",
                "type": "string"
            },
            "has_evidence": {
                "description": "connects an association to an instance of supporting evidence",
                "type": "string"
            },
            "has_evidence_graph": {
                "description": "connects an association to a graph object including a path from subject to object",
                "type": "string"
            },
            "has_evidence_type": {
                "description": "connects an association to the class of evidence used",
                "type": "string"
            },
            "id": {
                "type": "string"
            },
            "label": {
                "description": "A human-readable name for a thing",
                "type": "string"
            },
            "negated": {
                "description": "if set to true, then the association is negated i.e. is not true",
                "type": "string"
            },
            "object": {
                "description": "connects an association to the object of the association. For example, in a gene-to-phenotype association, the gene is subject and phenotype is object.",
                "type": "string"
            },
            "object_extensions": {
                "description": "Additional relationships that are true of the object in the context of the association. For example, if the object is an anatomical term in an expression association, the object extensions may include part-of links",
                "items": {
                    "type": "string"
                },
                "type": "array"
            },
            "provided_by": {
                "description": "connects an association to the agent (person, organization or group) that provided it",
                "type": "string"
            },
            "publications": {
                "description": "connects an association to publications supporting the association",
                "items": {
                    "type": "string"
                },
                "type": "array"
            },
            "qualifiers": {
                "description": "connects an association to qualifiers that modify or qualify the meaning of that association",
                "items": {
                    "type": "string"
                },
                "type": "array"
            },
            "quantifier_qualifier": {
                "description": "A measurable quantity for the object of the association",
                "type": "string"
            },
            "relation": {
                "description": "the relationship type by which a subject is connected to an object in an association",
                "type": "string"
            },
            "stage_qualifier": {
                "description": "stage at which expression takes place",
                "type": "string"
            },
            "subject": {
                "description": "connects an association to the subject of the association. For example, in a gene-to-phenotype association, the gene is subject and phenotype is object.",
                "type": "string"
            },
            "subject_extensions": {
                "description": "Additional relationships that are true of the subject in the context of the association. For example, if the subject is a gene product in a functional association, the subject extensions may represent  an isoform or a specific post-translational state",
                "items": {
                    "type": "string"
                },
                "type": "array"
            }
        },
        "required": [],
        "title": "GeneToExpressionSiteAssociation",
        "type": "object"
    },
FAQ
Why not use X as the modeling framework?

Why invent our own yaml and not use JSON-Schema, SQL, UML, ProtoBuf, OWL, …

each of these is tied to a particular formalisms. E.g. JSON-Schema to trees. OWL to open world logic. There are various impedance mismatches in converting between these. The goal was to develop something simple and more general that is not tied to any one serialization format or set of assumptions.

There are other projects with similar goals, e.g https://github.com/common-workflow-language/schema_salad

It may be possible to align with these.

Why not use X as the datamodel

Here X may be bioschemas, some upper ontology (BioTop), UMLS metathesaurus, bio*, various other attempts to model all of biology in an object model.

Currently as far as we know there is no existing reference datamodel that is flexible enough to be used here.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.