Name: FASTGenomics_Data_Package_Format
Owner: FASTGenomics
Description: Description of the FASTGenomics Data Package Format
Created: 2017-10-19 13:02:32.0
Updated: 2017-10-24 13:36:31.0
Pushed: 2017-12-22 10:09:42.0
Size: 12
Language: null
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Single-cell RNA-seq datasets typically consist of several data data tables that are
all required for the understanding of an experiment. The FASTGenomics
ecosystem for single-cell RNA-seq analyses provides functionality to make
these analyses as easy and convenient as possible. To enable the data analysis
with FASTGenomics, the dataset must be provided in
a defined format which is detailed below. Briefly, a dataset consists of files
containing expression data, metadata describing cells and genes as well as the
experimental conditions. To reduce disk space, all these files are bundled
into one ZIP file.
Apart from the package description below, the example
folder in this repository
contains correctly structured, however not zipped, files to illustrate how a FASTGenomics
Data Package should look like. Furthermore, the following link leads you to an R-based
step by step tutorial how to create a FASTGenomics Data Package.
The following table gives an overview of files that have to be included in a FASTGenomics dataset package:
manifest.yml
Details about the dataset package, including file definitions and dataset description.expression_data.tsv
Expression data for Entrez ID-coded genes in sparse FASTGenomics format used for data analysis.cell_metadata.tsv
Tab-separated file with metadata about cells used in the analysis.gene_metadata.tsv
File containing the gene (=Entrez) IDs used for analysis.The data package description is supplied via the manifest.yml file, which has the following structure:
Single-cell RNA-seq data is typically zero-inflated, i.e. for many genes no
gene expression values are available. In a dense gene expression matrix, where
each column represents a cell and each row represents a gene and each
row/column combination holds the expression value for a particular gene in a
particular cell, expression values of unexpressed and uncaptured genes are
represented as zeros. Depending on the technology used to generate the
single-cell expression dataset, the proportion of zeros in a dense matrix may
be higher than 90%. To save disk space, FASTGenomics therefore uses a sparse
matrix data format to store expression data. A FASTGenomics sparse expression
matrix file is a simple text file storing data in three tab-separated columns
with a header line. This case-sensitive header line stores mandatory column
names (cellId
, entrezId
and expressionValue
) and data type information
(Integer
, Number
and String
) separated by an asterisk (*
).
The first column contains zero-based integer values identifying cells.
The second column contains the identifier representing a gene (the Entrez ID
for genes analyzed in FASTGenomics). The third column holds the expression
value of this gene in a particular cell (see example below).
Id*Integer entrezId*Integer expressionValue*Number
12544 4.0
67608 1.0
12390 1.0
12544 5.0
67608 1.2
12390 3.3
12544 4.5
67608 10.0
12390 1.2
Gene expression values of cells examined in FASTGenomics are stored in
a file defined in the manifest describing the data package, e.g.
expression_data.tsv
. Expression data excluded from analysis (see below) can
be included in the dataset to increase transparency.
The FASTGenomics pipeline works with Entrez IDs for genes. Many single-cell expression datasets stored in public databases (e.g. Gene Expression Omnibus) encode genes with other ID types (e.g. ENSEMBL IDs or gene symbols). In this case, IDs must be mapped to Entrez IDs. Genes for which no Entrez ID or several Entrez IDs are available should be excluded from data analysis.
Gene metadata is supplied in tab-separated text files with a header line. Gene
metadata must be defined for all genes used in the analysis; the respective
file must have the name entrezId*Integer
in its first column, further columns,
e.g. containing mappings to gene symbols, etc. are optional. The optional file
containing metadata on excluded genes lists their IDs (in the first column
with appropriate type definition, e.g. gene*String
) along with optional
further information (e.g. reason for exclusion).
Cell metadata is provided in a tab-separated text file that contains a header line.
The first column should be cellId*Integer
, further columns can be added including information
about cell type, treatment, batch effects, published cluster assignments, etc.
If an image is being supplied in the manifest, use a common-known image format such as
png or jpg (e.g. dropseq.png
).
All files to be included in the FASTGenomics data package have to be packed in one ZIP file.