CD2H gitForager

FASTGenomics/FASTGenomics_Data_Package_Format

Name: FASTGenomics_Data_Package_Format

Owner: FASTGenomics

Description: Description of the FASTGenomics Data Package Format

Created: 2017-10-19 13:02:32.0

Updated: 2017-10-24 13:36:31.0

Pushed: 2017-12-22 10:09:42.0

Homepage:

Size: 12

Language: null

GitHub Committers

User	Most Recent Commit	# Commits

Other Committers

User	Email	Most Recent Commit	# Commits

README

The FASTGenomics Data Package Format

Introduction

Single-cell RNA-seq datasets typically consist of several data data tables that are all required for the understanding of an experiment. The FASTGenomics ecosystem for single-cell RNA-seq analyses provides functionality to make these analyses as easy and convenient as possible. To enable the data analysis with FASTGenomics, the dataset must be provided in a defined format which is detailed below. Briefly, a dataset consists of files containing expression data, metadata describing cells and genes as well as the experimental conditions. To reduce disk space, all these files are bundled into one ZIP file. Apart from the package description below, the example folder in this repository contains correctly structured, however not zipped, files to illustrate how a FASTGenomics Data Package should look like. Furthermore, the following link leads you to an R-based step by step tutorial how to create a FASTGenomics Data Package.

Components of a FASTGenomics Data Package

The following table gives an overview of files that have to be included in a FASTGenomics dataset package:

manifest.yml Details about the dataset package, including file definitions and dataset description.
expression_data.tsv Expression data for Entrez ID-coded genes in sparse FASTGenomics format used for data analysis.
cell_metadata.tsv Tab-separated file with metadata about cells used in the analysis.
gene_metadata.tsv File containing the gene (=Entrez) IDs used for analysis.

Data Package Description

The data package description is supplied via the manifest.yml file, which has the following structure:

schema_version: 2.1
data:
cell_metadata:
- file:
- organism:
- [optional] batch_column:
gene_metadata:
- file:
expression_data:
- file:
[optional] supplemental:
- unconsidered_genes:
- expression_data:
  - file:
- gene_metadata:
  - file:
metadata:
title: </li> <li>technology: <single-cell RNA-seq technology used to generate the dataset, e.g. MARS-seq></li> <li>version: <version of the dataset, starting from 1></li> <li>contact: <contact person and /or institution, e.g. Comma Soft AG></li> <li>description: <description of the datset, including experimental setting, link to public repository and ID of dataset, as well as link to publication, where applicable></li> <li>short_description: <short description of dataset, e.g. the experimental setting></li> <li>preprocessing:<ul> <li>notes: <description how the dataset has been prepared, including cell and gene exclusion criteria, etc.></li> <li>tools:</li> <li><tools used for dataset preparation, e.g. FGpackageR></li> <li><another tool used for dataset preparation></li> <li>[optional] image: <optional file name for image shown in the FASTGenomics data store, e.g. image.png></li> </ul> </li> </ul> <h6>File Format Specifications</h6> <h7>Sparse Expression File Format</h7> <p>Single-cell RNA-seq data is typically zero-inflated, i.e. for many genes no gene expression values are available. In a dense gene expression matrix, where each column represents a cell and each row represents a gene and each row/column combination holds the expression value for a particular gene in a particular cell, expression values of unexpressed and uncaptured genes are represented as zeros. Depending on the technology used to generate the single-cell expression dataset, the proportion of zeros in a dense matrix may be higher than 90%. To save disk space, FASTGenomics therefore uses a sparse matrix data format to store expression data. A FASTGenomics sparse expression matrix file is a simple text file storing data in three tab-separated columns with a header line. This case-sensitive header line stores mandatory column names (<code>cellId</code>, <code>entrezId</code> and <code>expressionValue</code>) and data type information (<code>Integer</code>, <code>Number</code> and <code>String</code>) separated by an asterisk (<code>*</code>). The first column contains zero-based integer values identifying cells. The second column contains the identifier representing a gene (the Entrez ID for genes analyzed in <a href="https://fastgenomics.org">FASTGenomics</a>). The third column holds the expression value of this gene in a particular cell (see example below).</p> <pre><code>Id*Integer entrezId*Integer expressionValue*Number 12544 4.0 67608 1.0 12390 1.0 12544 5.0 67608 1.2 12390 3.3 12544 4.5 67608 10.0 12390 1.2 </code></pre> <p>Gene expression values of cells examined in <a href="https://fastgenomics.org">FASTGenomics</a> are stored in a file defined in the manifest describing the data package, e.g. <code>expression_data.tsv</code>. Expression data excluded from analysis (see below) can be included in the dataset to increase transparency.</p> <h7>Gene IDs</h7> <p>The <a href="https://fastgenomics.org">FASTGenomics</a> pipeline works with Entrez IDs for genes. Many single-cell expression datasets stored in public databases (e.g. <a href="http://www.ncbi-nlm.nih.gov/geo">Gene Expression Omnibus</a>) encode genes with other ID types (e.g. ENSEMBL IDs or gene symbols). In this case, IDs must be mapped to Entrez IDs. Genes for which no Entrez ID or several Entrez IDs are available should be excluded from data analysis.</p> <h7>Gene Metadata</h7> <p>Gene metadata is supplied in tab-separated text files with a header line. Gene metadata must be defined for all genes used in the analysis; the respective file must have the name <code>entrezId*Integer</code> in its first column, further columns, e.g. containing mappings to gene symbols, etc. are optional. The optional file containing metadata on excluded genes lists their IDs (in the first column with appropriate type definition, e.g. <code>gene*String</code>) along with optional further information (e.g. reason for exclusion).</p> <h7>Cell Metadata</h7> <p>Cell metadata is provided in a tab-separated text file that contains a header line. The first column should be <code>cellId*Integer</code>, further columns can be added including information about cell type, treatment, batch effects, published cluster assignments, etc.</p> <h7>Image</h7> <p>If an image is being supplied in the manifest, use a common-known image format such as png or jpg (e.g. <code>dropseq.png</code>).</p> <h6>Package Bundling</h6> <p>All files to be included in the FASTGenomics data package have to be packed in one ZIP file.</p> </div> <div style="width: 100%; float: left"> <div class="container-fluid pt-5"> <hr/> </div> <div class="d-flex justify-content-center flex-row flex-wrap pb-1"> <ul id="menu-footer-menu" class="foot nav justify-content-center"> <li itemscope="itemscope" itemtype="https://www.schema.org/SiteNavigationElement" id="menu-item-985" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-985 nav-item"><a title="NIH" href="https://www.nih.gov/" class="nav-link">NIH</a></li> <li itemscope="itemscope" itemtype="https://www.schema.org/SiteNavigationElement" id="menu-item-988" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-988 nav-item"><a title="NCATS" href="https://ncats.nih.gov/" class="nav-link">NCATS</a></li> <li itemscope="itemscope" itemtype="https://www.schema.org/SiteNavigationElement" id="menu-item-991" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-991 nav-item"><a title="CTSA" href="https://ctsa.ncats.nih.gov" class="nav-link">CTSA</a></li> <li itemscope="itemscope" itemtype="https://www.schema.org/SiteNavigationElement" id="menu-item-994" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-994 nav-item"><a title="CD2H" href="https://ctsa.ncats.nih.gov/cd2h/" class="nav-link">CD2H</a></li> <li itemscope="itemscope" itemtype="https://www.schema.org/SiteNavigationElement" id="menu-item-997" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-997 nav-item"><a title="CD2H Labs" href="http://labs.cd2h.org" class="nav-link">CD2H Labs</a></li> </ul> </div> <div class="container pl-0 pr-0"> <div class="container-fluid mt-4">This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.</div> </div> <div class="row justify-content-center pb-3"> <a href="https://twitter.com/data2health?lang=en"> <i class="fab fa-twitter-square fa-2x pr-1" style="background: -webkit-linear-gradient(#c0deed, #0084b4); -webkit-background-clip: text; -webkit-text-fill-color: transparent;"></i> </a> <a href="https://github.com/data2health"> <i class="fab fa-github-square fa-2x pl-1" style="background: -webkit-linear-gradient(#eee, #333); -webkit-background-clip: text; -webkit-text-fill-color: transparent;"></i> </a> </div> </div> </div> </div> </body> </html>