LLNL/spark-hdf5

Name: spark-hdf5

Owner: Lawrence Livermore National Laboratory

Description: A plugin to enable Apache Spark to read HDF5 files

Created: 2016-08-03 18:00:01.0

Updated: 2018-03-10 16:36:23.0

Pushed: 2016-11-17 20:47:27.0

Homepage:

Size: 9290

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Spark-HDF5 Build Status

Progress

The plugin can read single-dimensional arrays from HDF5 files.

The following types are supported:

Setup

If you are using the sbt-spark-package, the easiest way to use the package is by requiring it from the spark packages website:

pendencies += "LLNL/spark-hdf5:0.0.4"

Otherwise, download the latest release jar and include it on your classpath.

Usage
rt gov.llnl.spark.hdf._

df = sqlContext.read.hdf5("path/to/file.h5", "/dataset")
how

You can start a spark repl with the console target:

console

This will fetch all of the dependencies, set up a local Spark instance, and start a Spark repl with the plugin loaded.

Options

The following options can be set:

Key | Default | Description ————-|———|———— extension | h5 | The file extension of data chunk size | 10000 | The maximum number of elements to be read in a single scan

Testing

The plugin includes a test suite which can be run through SBT

test
Roadmap
Release

This code was developed at the Lawrence Livermore National Lab (LLNL) and is available under the Apache 2.0 license (LLNL-CODE-699384)


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.