Name: dragnet
Owner: Bombora
Description: Just the facts -- web page content extraction
Created: 2017-08-04 22:56:31.0
Updated: 2017-08-09 15:30:33.0
Pushed: 2018-04-24 04:51:17.0
Homepage: null
Size: 347042
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Dragnet isn't interested in the shiny chrome or boilerplate dressing of a web page. It's interested in… 'just the facts.' The machine learning models in Dragnet extract the main article content and optionally user generated comments from a web page. They provide state of the art performance on variety of test benchmarks.
For more information on our approach check out:
This project was originally inspired by Kohlschütter et al, Boilerplate Detection using Shallow Text Features and Weninger et al CETR – Content Extraction with Tag Ratios, and more recently by Readability.
Depending on your use case, we provide two separate functions to extract just the main article content or the content and any user generated comments. Each function takes an HTML string and returns the content string.
rt requests
dragnet import extract_content, extract_content_and_comments
tch HTML
= 'https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/'
requests.get(url)
t main article without comments
ent = extract_content(r.content)
t article and comments
ent_comments = extract_content_and_comments(r.content)
We also provide a sklearn-style extractor class(complete with fit
and
predict
methods). You can either train an extractor yourself, or load a
pre-trained one:
dragnet.util import load_pickled_model
ent_extractor = load_pickled_model(
'kohlschuetter_readability_weninger_content_model.pkl.gz')
ent_comments_extractor = load_pickled_model(
'kohlschuetter_readability_weninger_comments_content_model.pkl.gz')
ent = content_extractor.extract(r.content)
ent_comments = content_comments_extractor.extract(r.content)
If you know the encoding of the document (e.g. from HTTP headers), you can pass it down to the parser:
ent = content_extractor.extract(html_string, encoding='utf-8')
Otherwise, we try to guess the encoding from a meta
tag or specified
<?xml encoding=".."?>
tag. If that fails, we assume “UTF-8”.
Dragnet is written in Python (developed with 2.7, with support recently added for 3) and built on the numpy/scipy/Cython numerical computing environment. In addition we use lxml (libxml2) for HTML parsing.
We recommend installing from the master branch to ensure you have the latest version.
This is the easiest method to install Dragnet and builds a Vagrant virtual machine with Dragnet and it's dependencies.
git clone git@github.com:seomoz/dragnet.git
vagrant up
ant ssh
ese should now pass
ke test
libxml2
). We use provision.sh
to provision the Vagrant VM so you can use it as a template and modify
as appropriate for your operation system.git clone git@github.com:seomoz/dragnet.git
sudo pip install -r dragnet/requirements.txt
dragnet
do make install
ese should now pass
ke test
We love contributions! Open an issue, or fork/create a pull request.
The Extractor
class encapsulates a blockifier, some feature extractors, and
and a machine learning model.
A blockifier implements blockify
that takes a HTML string and returns a list
of block objects. A feature extractor is a callable that takes a list
of blocks and returns a numpy array of features (len(blocks), nfeatures)
.
There is some additional optional functionality
to “train” the feature (e.g. estimate parameters needed for centering)
specified in features.py
. The machine learning model implements
the scikits-learn interface (predict
and fit
) and is used to compute
the content/no-content prediction for each block.
The training and test data is available at dragnet_data.
Download the training data (see above). In what follows ROOTDIR
contains
the root of the dragnet_data
repo, another directory with similar
structure (HTML
and Corrected
sub-directories).
Create the block corrected files needed to do supervised learning on the block level.
First make a sub-directory $ROOTDIR/block_corrected/
for the output files, then run:
dragnet.data_processing import extract_all_gold_standard_data
dir = '/path/to/dragnet_data/'
act_all_gold_standard_data(rootdir)
This solves the longest common sub-sequence problem to determine which blocks were extracted in the gold standard. Occasionally this will fail if lxml (libxml2) cannot parse a HTML document. In this case, remove the offending document and restart the process.
Use k-fold cross validation in the training set to do model selection and set any hyperparameters. Make decisions about the following:
For example, to train the randomized decision tree classifier from sklearn using the shallow text features from Kohlschuetter et al. and the CETR features from Weninger et al.:
dragnet.blocks import TagCountNoCSSReadabilityBlockifier
dragnet.extractor import Extractor
dragnet.model_training import train_model
sklearn.ensemble import ExtraTreesClassifier
dir = '/path/to/dragnet_data/'
ures = ['kohlschuetter', 'weninger', 'readability']
xtract = 'both' # or 'content'
l = ExtraTreesClassifier(
n_estimators=10,
max_features=None,
min_samples_leaf=75
_extractor = Extractor(TagCountNoCSSReadabilityBlockifier,
features=features,
to_extract=to_extract,
model=model
actor = train_model(base_extractor, rootdir)
This trains the model and, if a value is passed to output_dir
, writes a
pickled version of it along with some some block level classification
errors to a file in the specified output_dir
. If no output_dir
is
specified, the block-level performance is printed to stdout.
Once you have decided on a final model, train it on the entire training
data using dragnet.model_training.train_models
.
As a last step, test the performance of the model on the test set (see below).
Use evaluate_models_predictions
in model_training
to compute the token level
accuracy, precision, recall, and F1. For example, to evaluate a trained model
run:
dragnet.compat import train_test_split
dragnet.data_processing import prepare_all_data
dragnet.model_training import evaluate_model_predictions
dir = '/path/to/dragnet_data/'
= prepare_all_data(rootdir)
ning_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
_blocks, test_labels, test_weights = extractor.concatenate_data(test_data)
n_blocks, train_labels, train_weights = extractor.concatenate_data(training_data)
actor.fit(train_blocks, train_labels, weights=train_weights)
ictions = extractor.predict(test_blocks)
es = evaluate_model_predictions(test_labels, predictions)