wikipathways/bioentities

Name: bioentities

Owner: WikiPathways

Description: Namespace encoding hierarchical relationships between proteins, protein families, and protein complexes.

Forked from: johnbachman/bioentities

Created: 2018-01-11 21:05:38.0

Updated: 2018-04-13 01:51:05.0

Pushed: 2018-04-13 01:51:03.0

Homepage: null

Size: 560

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Bioentities

Bioentities is a collection of resources for grounding biological entities from text and describing their hierarchical relationships. Resources were developed by manual curation for use by natural language processing and biological modeling teams in the DARPA Big Mechanism and Communicating with Computers programs. The repository contains the following files:

Entities, Relations and Equivalences

Bioentities contains resources for defining the relationships between genes/proteins and their membership in families and named complexes. Entities defined within the Bioentities namespace are listed in the `entities.csv` file. Cross-referencing the entries among the various files maintains consistency and prevents errors.

Relationships are defined in `relations.csv` as a triples using two relationships:

These two relationships can be combined to capture complex hierarchical relationships, including sub-families (families within families) and complexes consisting of families of related subunits (e.g., PI3K, NF-kB).

The `relations.csvfile consists of five columns: (1) the namespace for the subject (e.g., ``HGNC` for gene names,UP`` for Uniprot, or `BE for the Bioentities namespace), (2) the identifier for the subject, (3) the relationship (``isa` orpartof``), (4) the namespace for the object, and (5) the identifier for the object.

The `equivalences.csvfile consists of three columns (1) the namespace of an outsite entity (e.g. ``BEL`,PFAM``), (2) the identifier of the outside entity in the namespace given in the first column, and (3) the equivalent entity in the `BE` namespace.

Grounding Map

Using mechanisms extracted from text mining to explain biological datasets requires that the entities in text are correctly grounded to the canonical names and IDs of genes, proteins, and chemicals. The problem is that simple lookups based on string matching often fail, particularly for protein families and named complexes, which appear frequently in text but lack corresponding entries in databases.

The grounding map addresses this by providing explicit grounding for frequently encountered entities in the biological literature. The text strings were drawn from a corpus of roughly 32,000 papers focused on growth factor signaling in cancer.

Entities are grounded to the following databases:

Note: Some text strings in the map have no grounding. This was originally used to identify entities that represent parsing errors and that should not be included in downstream output. For example, “MAP” a degenerate extraction that could signify many entities, including MAP kinase, MAP kinase inhibitor, MAP kinase kinase, etc. However, these empty entries could be used differently depending on the downstream application.

Gene prefixes

The file `gene_prefixes.csv` enumerates prefixes and suffixes frequently appended to named entities. Some of these represent subtleties of experimental context (for example, that a protein of interest was tagged with a fluorescent protein in an experiment) that can safely be ignored when determining the logic of a sentence. However, others carry essential meaning: for example, a sentence describing the effect of 'AKT shRNA' on a downstream target has the opposite meaning of a sentence involving 'AKT', because 'AKT shRNA' represents inhibition of AKT by genetic silencing.

The patterns included in this file were found by manually reviewing 70,000 named entities extracted by the REACH parser from a corpus of roughly 32,000 papers focused on growth factor signaling.

**Important note: the prefixes/suffixes may be applied additively, for example

The file contains three columns:

  1. A case-sensitive pattern, e.g., `mEGFP-{Gene name}, where ``{Gene name}``` represents a protein/gene name.
  2. A category, described below.
  3. Notes: spelling out acronyms, etc.

The category of the prefix/suffix determines whether it can be stripped off with minimal effect on the meaning, or whether it carries meaning that needs to be incorporated by a parser. The categories are as follows:

Contributing

Contributions are welcome! If making additions or revisions to the CSV files take care to handle quotations and newlines correctly. This allows diffs to be handled correctly so changes can be reviewed. Please submit updates via pull requests on Github.

The CSV files in the Bioentities repo are set up to be edited natively using Microsoft Excel. The CSV files in the repo have Windows line terminators ('\r\n'), and are not ragged (i.e., missing entries in a row are padded out with empty strings to reach the full width of the longest row).

To preserve correct newlines, take the following steps:

  1. If saving from Excel (Windows or Mac OS X), save to the “Windows Comma Separated (.csv)” format.

  2. If reading (or writing) the files using a Python script, use the following set of csv format parameters::

    csvreader = csv.reader(f, delimiter=',', quotechar='“',

                       quoting=csv.QUOTE_MINIMAL, lineterminator='\r\n')
    
  3. If editing the files on Linux, post-process files using `unix2dos` or a similar program.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.