newsdev/tabula

Name: tabula

Owner: NYT Newsroom Developers

Description: Tabula is a tool for liberating data tables trapped inside PDF files

Created: 2015-05-27 21:20:23.0

Updated: 2017-03-15 19:58:49.0

Pushed: 2017-03-15 19:58:53.0

Homepage: http://tabula.technology

Size: 52163

Language: CSS

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Repo Note: The master branch is an in development version of Tabula. This may be substantially different from the latest releases of Tabula.

As of August 2015, the master branch (and Tabula 1.1.X+) uses tabula-java instead of tabula-extractor under the hood. Previous versions of Tabula use tabula-extractor.


Tabula

tabula master Build Status

Tabula helps you liberate data tables trapped inside PDF files.

© 2012-2016 Manuel Aristarán. Available under MIT License. See AUTHORS.md and LICENSE.md.

Why Tabula?

If you?ve ever tried to do anything with data provided to you in PDFs, you know how painful this is ? you can?t easily copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data in CSV format, through a simple web interface (Check out this short screencast)

Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.

Security Concerns?: Tabula is designed with security in mind. Your PDF and the extracted data never touch the net – when you use Tabula, as long as your browser's URL bar says “localhost” or “127.0.0.1”, all processing takes place on your local machine. Tabula does download a list of Tabula versions from our server to alert you if Tabula has been updated (and we use hits to that list to count how often Tabula is being used); it also downloads a few badges and assets from the web.

Using Tabula

First, make sure you have a recent copy of Java installed. You can download Java here. Tabula requires a Java Runtime Environment compatible with Java 7 (i.e. Java 7, 8 or higher). If you have a problem, check Known Issues first, then report an issue.

If the program fails to run, double-check that you have Java installed and then try again.

Known issues

There are some bugs that we're aware of that we haven't managed to fix yet. If there's not a solution here or you need more help, please go ahead and report an issue.

Incorporating Tabula into your own project

Tabula is open-source, so we'd love for you to incorporate pieces of Tabula into your own projects. The “guts” of Tabula – that is, the logic and heuristics that reconstruct tables from PDFs – is contained in the tabula-java repo. There's a JAR file that you can easily incorporate into JVM languages like Java, Scala or Clojure and it includes a command-line tool for you to automate your extraction tasks. Visit that repo for more information on how to use tabula-java on the CLI and on how Tabula exports tabula-java scripts.

Bindings:

Tabula has bindings for JRuby and R. If you end up writing bindings for another language (Python, in particular!), let us know and we'll add a link here.

Running Tabula from source (for developers)
  1. Download JRuby. You can install it from its website, or using tools like rvm or rbenv. Note that as of Tabula 1.1.0 (7875582becb2799b65586d5680782cafd399bb33), Tabula uses the JRuby 9000 series (i.e. JRuby 9.1.5.0).

  2. Download Tabula and install the Ruby dependencies. (Note: if using rvm or rbenv, ensure that JRuby is being used.

    clone git://github.com/tabulapdf/tabula.git
    abula
    
    install bundler
    install tabula-extractor
    le install
    

Then, start the development server:

bundle exec rackup

(If you get encoding errors, set the JAVA_OPTS environment variable to -Dfile.encoding=utf-8)

The site instance should now be viewable at http://127.0.0.1:9292/ .

You can a couple some options when executing the server in this manner:

TABULA_DATA_DIR="/tmp/tabula" \
TABULA_DEBUG=1 \
bundle exec rackup

Alternatively, running the server as a JAR file

Testing in this manner will be closer to testing the “packaged application” version of the app.

bundle exec rake war
java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -jar build/tabula.jar
Building a packaged application version

After performing the above steps (“Running Tabula from source”), you can compile Tabula into a standalone application:

Mac OS X

If you wish to share Tabula with other machines, you will need a codesigning certificate. Our distribution of Tabula uses a self-signed certificate, as noted above. See this section of build.xml for details. If you will only be running Tabula on the machine you are building it on, you may remove this entire block (lines 44-53).

To compile the app:

rake macosx

This will result in a portable “tabula_mac.zip” archive (inside the build directory) for Mac OS X users.

Note that the Mac version bundles Java with the Tabula app. This results in a 98MB zip file, versus the 30MB zip file for other platforms, but allows users to run Tabula without having to worry about Java version incompatibilities.

Windows

You can build .exe files for the Windows target on any platform.

Download a 3.1.X (beta) copy of Launch4J.

Unzip it into the Tabula repo so that “launch4j” (with subdirectories “bin”, etc.) is in the repository root.

(If you're building on a 64bit Linux, you may need to install 32bit libs like, in Ubuntu sudo apt-get install lib32z1 lib32ncurses5)

Then:

rake windows

This will result in a portable “tabula_win.zip” archive (inside the build directory) for Mac OS X users.


If you have issues, you can try building manually. (These commands are for OS X/Linux and may need to be adjusted for Windows users.)

# (from the root directory of the repo)
rake war
cd launch4j
ant -f ../build.xml windows

A “tabula.exe” file will be generated in “build/windows”. To run, the exe file needs “tabula.jar” (contained in “build”) in the same directory. You can create a .zip archive by doing:

# (from the root directory of the repo)
cd build/windows
mkdir tabula
cp tabula.exe ./tabula/
cp ../tabula.jar ./tabula/
zip -r9 tabula_win.zip tabula
rm -fr tabula
Contributing

Interested in helping out? We'd love to have your help!!

You can help by:

Backers

You can also support our continued work on Tabula with a one-time or monthly donation on OpenCollective. Organizations who use Tabula can also sponsor the project for acknolwedgement on our official site and this README.

Special thanks to the following users and organizations for generously supporting Tabula with donations and grants:

The John S. and James L. Knight Foundation

More acknowledgments can be found in AUTHORS.md.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.