CBIIT/HPC_DME_APIs

Name: HPC_DME_APIs

Owner: NCI CBIIT Development Sandbox

Description: NCI High Performance Computing Data Management Services Common APIs

Created: 2016-11-09 19:01:55.0

Updated: 2017-11-03 21:26:59.0

Pushed: 2018-01-14 21:50:32.0

Homepage: null

Size: 436483

Language: JavaScript

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

HPC_DME_APIs

NCI High Performance Computing Data Management Services Common APIs

One of the most significant challenges to overcome for an effective high performance computing (HPC) support effort is effective data management, i.e., effective tracking, annotation and staging of digital datasets, accompanied with a data life cycle plan/policy for these datasets. While frequently not considered an HPC challenge or opportunity, an effective solution is needed to contain costs for stored data while increasing the scientific usefulness of data that has been created in the era of ?big data? where analysis of datasets can take days and total cost to store and maintain large datasets continue to tax personnel and financial resources. Without a reliable managed dataset solution, large datasets are frequently maintained in multiple copies across the physical storage in an isolated fashion, leading to an unnecessary expense as additional storage is required for analysis and storage of new data. A managed, secured, and high-availability solution will minimize the need for maintaining unnecessarily redundant copies of large datasets. Even with projected declines in the cost of physical storage, the investment in managing stored data without associated annotation will provide only minimal (at best) long-term scientific usefulness or support to advance the mission of the NCI.

Annotation and registration of large datasets is inherent for managed datasets to effectively deliver broader scientific impact and advance the mission of the NCI. Consistent with efforts already underway at the NIH within the Big Data To Knowledge (BD2K) program, annotation and registration of datasets will enable managed datasets to be of use to the community of extended and future cancer investigators. The creation and delivery of metadata and tracking utilization of datasets will provide the key insight into scientific impact for each maintained dataset.

Without an effective data management solution, the HPC effort will struggle with difficulties in staging data for analysis, recovering generated datasets, and inefficiencies created by insufficient physical storage and recomputing results that have once been completed. Therefore, we believe that: ? NCI is in critical need of advancing its core scientific and technological means of data management and services from large, diverse, distributed and heterogeneous datasets ? Large datasets are currently maintained in multiple copies across physical storage in an isolated fashion, leading to an unnecessary expense ? Annotation and registration of datasets is inherent for managed datasets to effectively deliver broader scientific impact and enable the full power of personalized medicine ? Strategically, the absence of an effective data management solution presents a barrier to supporting emerging efforts to leverage the breadth of generated datasets for use in development of computationally and data intensive predictive models as well as efforts to utilize cloud resources for collaboration and analysis.

The NCI HPC DME Data Management initiative is aimed to overcome these challenges and pave the way for big data based personlized medicine and innovative cancer treatment/prevention exploration and discovery.

2 steps in putting development deliverables into GitHubs: A. First, we will be expose development documentation, user guide and training material B. After our cose base is stabilized and the HPC DME is in operation or DevOps mode, codebase will be ported as well C. Current support model is that we will give the public read access to all documentation or codebase posted here in GitHub. Only the NCI/Leidos Development/Support team will have write access to these areas.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.