twitter/hdfs-du

Name: hdfs-du

Owner: Twitter, Inc.

Description: Visualize your HDFS cluster usage

Created: 2012-08-07 17:52:22.0

Updated: 2017-12-29 06:35:49.0

Pushed: 2017-04-27 09:58:44.0

Homepage: https://twitter.com/hdfsdu

Size: 1303

Language: JavaScript

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

HDFS-DU Build Status

hdfsdu UI screenshot

HDFS-DU is an interactive visualization of the Hadoop distributed file system. The project aims to monitor different snapshots for the entire HDFS system in an interactive way, showing the size of the folders, the rate at which the size increases / decreases, and to highlight inefficient file storage.

HDFS-DU provides the following in a web UI:

HDFS-DU is built using the following front-end technologies:

Follow @hdfsdu on Twitter to stay in touch!

Examples

Below is a screenshot of the HDFS-DU UI. The UI is made up of two linked visualizations. The left visualization is a tree-map which shows parent-child relationships through containment. The right visualization is a file-tree, which displays two levels of depth from the current selected node in the file system. The file-tree visualization displays extra information for each node on hover.

hdfsdu UI screenshot

You can drill down clicking on nodes in either the tree-map or the file-tree.

There are two possible layouts for the tree-map. In the first layout the area of a node is proportional to the total file size of its descendants. In the second layout the area of a node is proportional to the count of its descendants.

To compute the color of a node, its size, including all its descendants, is divided by the number of those descendants. The color is assigned using this value such that a lighter color means more files for a given size. This helps to highlight inefficient nodes which contain too many small files.

hdfsdu UI screenshot

Quickstart

To get started with hdfs-du, first clone the hdfs-du GitHub repository:

clone https://github.com/twitter/hdfs-du.git
dfs-du

Next, you can try running the hdfs-du demo on your local machine. The demo starts a local web server which serves the front-end client resources and sample data. Start the demo with the following command and then browse to http://localhost:20000/index.html:

mo.sh
Running HDFS-DU with your own data

To visualize your own cluster, you need to generate an HDFS-DU data set. Currently this is a multi-step process:

Simplifying this process is certainly possible (hey, it was hack week :)

First, SSH to your secondary name node, dump the fsimage in delimited format, and copy to HDFS.

op oiv -i /path/to/current/fsimage -o fsimage-delimited.tsv -p Delimited
op fs -copyFromLocal fsimage-delimited.tsv .

Now let's process fsimage export. Uncomment the register statement in pig/src/test/resources/hdfsdu.pig, build the UDF, and run Pig to process the fsimage export.

path/to/hdfs-du/pig
package
-param INPUT=fsimage-delimited.tsv -param OUTPUT=hdfsdu.out pig/src/test/resources/hdfsdu.pig

Lastly we need to copy the dataset local and perform a quick post-processing step.

op fs -getmerge hdfsdu.out .
on src/main/python/leaf.py hdfsdu.out/hdfsdu.out > hdfsdu.data

Now we're ready to start HDFS-DU!

art.sh /path/to/hdfsdu.data

Point your web browser to http://localhost:20000 and see what your cluster looks like.

How to contribute

Bug fixes, features, and documentation improvements are welcome! Please fork the project and send us a pull request on GitHub. You can submit issues on Github as well.

Here are some high-level goals we'd love to see contributions for:

Authors
License

Copyright 2012 Twitter, Inc.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.