sara-nl/warcutils

Name: warcutils

Owner: SURFsara

Description: Library with utility classes for working with the 2014 Common Crawl warc, wet and wat files.

Forked from: norvigaward/warcutils

Created: 2016-02-04 15:30:01.0

Updated: 2017-11-28 17:11:20.0

Pushed: 2017-01-28 12:28:29.0

Homepage: null

Size: 296

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

warcutils

Library with utility classes for working with the March 2014 Common Crawl warc, wet and wat files on the SURFsara hadoop cluster environment.

Introduction

The March 2014 Common Crawl data is provided in the form of warc, wet and wat files. For the file format specification see: http://bibnum.bnf.fr/WARC/.

Multiple parser exist for this format, we have chosen to use and provide utilities which use the Java Web Archive Toolkit libraries: https://sbforge.org/display/JWAT/JWAT. Both readers and inputformats for regular warc.gz, warc.wat.gz and warc.wet.gz files as well as readers and inputformats for warcfiles which have been converted to sequencefiles (for more information on this see: this description of the data) are provided. In addition two implementations of Pig load functions have been provided (one for regular files and one for sequence files).

Building

Both Ant&Ivy as wel as Maven build files are provided with the project. The latest builds can also be obtained from our maven repository: http://beehub.nl/surfsara-repo/releases/SURFsara/warcutils

Usage

Please see the warcexamples project: https://github.com/norvigaward/warcexamples. For some examples that use this library.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.