particle-iot/node-simplecrawler

Name: node-simplecrawler

Owner: Particle

Description: Flexible event driven crawler for node.

Forked from: simplecrawler/simplecrawler

Created: 2016-02-16 18:04:02.0

Updated: 2016-02-16 18:04:03.0

Pushed: 2018-01-03 18:35:52.0

Homepage:

Size: 435

Language: JavaScript

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Simple web-crawler for Node.js

NPM version Build Status Dependency Status devDependency Status

Simplecrawler is designed to provide the most basic possible API for crawling websites, while being as flexible and robust as possible. I wrote simplecrawler to archive, analyse, and search some very large websites. It has happily chewed through 50,000 pages and written tens of gigabytes to disk without issue.

Example (simple mode)
Crawler = require("simplecrawler");

ler.crawl("http://example.com/")
.on("fetchcomplete", function(queueItem) {
    console.log("Completed fetching resource:", queueItem.url);
});
What does simplecrawler do?
Installation
install simplecrawler
Getting Started

There are two ways of instantiating a new crawler - a simple but less flexible method inspired by anemone, and the traditional method which provides a little more room to configure crawl parameters.

Regardless of whether you use the simple or traditional methods of instantiation, you'll need to require simplecrawler:

Crawler = require("simplecrawler");
Simple Mode

Simple mode generates a new crawler for you, preconfigures it based on a URL you provide, and returns the crawler to you for further configuration and so you can attach event handlers.

Simply call Crawler.crawl, with a URL first parameter, and two optional functions that will be added as event listeners for fetchcomplete and fetcherror respectively.

ler.crawl("http://example.com/", function(queueItem) {
console.log("Completed fetching resource:", queueItem.url);

Alternately, if you decide to omit these functions, you can use the returned crawler object to add the event listeners yourself, and tweak configuration options:

crawler = Crawler.crawl("http://example.com/");

ler.interval = 500;

ler.on("fetchcomplete", function(queueItem) {
console.log("Completed fetching resource:", queueItem.url);

Advanced Mode

The alternative method of creating a crawler is to call the simplecrawler constructor yourself, and to initiate the crawl manually.

myCrawler = new Crawler("www.example.com");

Nonstandard port? HTTPS? Want to start archiving a specific path? No problem:

awler.initialPath = "/archive";
awler.initialPort = 8080;
awler.initialProtocol = "https";

r:
myCrawler = new Crawler("www.example.com", "/archive", 8080);

And of course, you're probably wanting to ensure you don't take down your web server. Decrease the concurrency from five simultaneous requests - and increase the request interval from the default 250 ms like this:

awler.interval = 10000; // Ten seconds
awler.maxConcurrency = 1;

You can also define a max depth for links to fetch:

awler.maxDepth = 1; // Only first page is fetched (with linked CSS & images)
r:
awler.maxDepth = 2; // First page and discovered links from it are fetched
r:
awler.maxDepth = 3; // Etc.

For brevity, you may also specify the initial path and request interval when creating the crawler:

myCrawler = new Crawler("www.example.com", "/", 8080, 300);
Running the crawler

First, you'll need to set up an event listener to get the fetched data:

awler.on("fetchcomplete", function(queueItem, responseBuffer, response) {
console.log("I just received %s (%d bytes)", queueItem.url, responseBuffer.length);
console.log("It was a resource of type %s", response.headers['content-type']);

// Do something with the data in responseBuffer

Then, when you're satisfied you're ready to go, start the crawler! It'll run through its queue finding linked resources on the domain to download, until it can't find any more.

awler.start();

Of course, once you've got that down pat, there's a fair bit more you can listen for…

Events
A note about HTTP error conditions

By default, simplecrawler does not download the response body when it encounters an HTTP error status in the response. If you need this information, you can listen to simplecrawler's error events, and through node's native data event (response.on("data",function(chunk) {...})) you can save the information yourself.

If this is annoying, and you'd really like to retain error pages by default, let me know. I didn't include it because I didn't need it - but if it's important to people I might put it back in. :)

Waiting for Asynchronous Event Listeners

Sometimes, you might want to wait for simplecrawler to wait for you while you perform some asynchronous tasks in an event listener, instead of having it racing off and firing the complete event, halting your crawl. For example, if you're doing your own link discovery using an asynchronous library method.

Simplecrawler provides a wait method you can call at any time. It is available via this from inside listeners, and on the crawler object itself. It returns a callback function.

Once you've called this method, simplecrawler will not fire the complete event until either you execute the callback it returns, or a timeout is reached (configured in crawler.listenerTTL, by default 10000 ms.)

Example Asynchronous Event Listener
ler.on("fetchcomplete", function(queueItem, data, res) {
var continue = this.wait();
doSomeDiscovery(data, function(foundURLs) {
    foundURLs.forEach(crawler.queueURL.bind(crawler));
    continue();
});

Configuring the crawler

Here's a complete list of what you can stuff with at this stage:

Excluding certain resources from downloading

Simplecrawler has a mechanism you can use to prevent certain resources from being fetched, based on the URL, called Fetch Conditions. A fetch condition is just a function, which, when given a parsed URL object, will return a boolean that indicates whether a given resource should be downloaded.

You may add as many fetch conditions as you like, and remove them at runtime. Simplecrawler will evaluate every single condition against every queued URL, and should just one of them return a falsy value (this includes null and undefined, so remember to always return a value!) then the resource in question will not be fetched.

Adding a fetch condition

This example fetch condition prevents URLs ending in .pdf from being downloaded. Adding a fetch condition assigns it an ID, which the addFetchCondition function returns. You can use this ID to remove the condition later.

conditionID = myCrawler.addFetchCondition(function(parsedURL, queueItem) {
return !parsedURL.path.match(/\.pdf$/i);

Fetch conditions are called with two arguments: parsedURL and queueItem. parsedURL is the resource to be fetched (or not) and has the following structure:


protocol: "http",
host: "example.com",
port: 80,
path: "/search?q=hello",
uriPath: "/search",
depth: 2

queueItem is a representation of the page where this resource was found, it looks like this:


url: "http://example.com/index.php",
protocol: "http",
host: "example.com",
port: 80,
path: "/index.php",
depth: 1,
fetched: true,
status: "downloaded",
stateData: {...}

This information enables you to write sophisticated logic for which pages to fetch and which to avoid. You could, for example, implement a link checker that not only checks your site, but also links to external sites, but doesn't continue crawling those sites by setting filterByDomain to false and checking that queueItem.host is the same as crawler.host.

Removing a fetch condition

If you stored the ID of the fetch condition you added earlier, you can remove it from the crawler:

awler.removeFetchCondition(conditionID);
Excluding resources based on robots.txt

Simplecrawler purposely doesn't come with any built in support for parsing robots.txt rules. Adding support manually is very straightforward using fetch conditions however, and in examples/robots-txt-example.js you'll find an example that makes use of the robots-parser module to do just that.

The Simplecrawler Queue

Simplecrawler has a queue like any other web crawler. It can be directly accessed at crawler.queue (assuming you called your Crawler() object crawler.) It provides array access, so you can get to queue items just with array notation and an index.

ler.queue[5];

For compatibility with different backing stores, it now provides an alternate interface which the crawler core makes use of:

ler.queue.get(5);

It's not just an array though.

Adding to the queue

The simplest way to add to the queue is to use the crawler's own method, crawler.queueURL. This method takes a complete URL, validates and deconstructs it, and adds it to the queue.

If you instead want to add a resource by its components, you may call the queue.add method directly:

ler.queue.add(protocol, hostname, port, path);

That's it! It's basically just a URL, but comma separated (that's how you can remember the order.)

Queue items

Because when working with simplecrawler, you'll constantly be handed queue items, it helps to know what's inside them. These are the properties every queue item is expected to have:

You can address these properties like you would any other object:

ler.queue[52].url;
eItem.stateData.contentLength;
eItem.status === "queued";

As you can see, you can get a lot of meta-information out about each request. The upside is, the queue actually has some convenient functions for getting simple aggregate data about the queue…

Queue Statistics and Reporting

First of all, the queue can provide some basic statistics about the network performance of your crawl (so far.) This is done live, so don't check it thirty times a second. You can test the following properties:

And you can get the maximum, minimum, and average values for each with the crawler.queue.max, crawler.queue.min, and crawler.queue.avg functions respectively. Like so:

ole.log("The maximum request latency was %dms.", crawler.queue.max("requestLatency"));
ole.log("The minimum download time was %dms.", crawler.queue.min("downloadTime"));
ole.log("The average resource size received is %d bytes.", crawler.queue.avg("actualDataSize"));

You'll probably often need to determine how many items in the queue have a given status at any one time, and/or retrieve them. That's easy with crawler.queue.countWithStatus and crawler.queue.getWithStatus.

crawler.queue.countWithStatus returns the number of queued items with a given status, while crawler.queue.getWithStatus returns an array of the queue items themselves.

redirectCount = crawler.queue.countWithStatus("redirected");

ler.queue.getWithStatus("failed").forEach(function(queueItem) {
console.log("Whoah, the request for %s failed!", queueItem.url);

// do something...

Then there's some even simpler convenience functions:

Saving and reloading the queue (freeze/defrost)

You'll probably want to be able to save your progress and reload it later, if your application fails or you need to abort the crawl for some reason. (Perhaps you just want to finish off for the night and pick it up tomorrow!) The crawler.queue.freeze and crawler.queue.defrost functions perform this task.

A word of warning though - they are not CPU friendly as they rely on JSON.parse and JSON.stringify. Use them only when you need to save the queue - don't call them every request or your application's performance will be incredibly poor - they block like crazy. That said, using them when your crawler commences and stops is perfectly reasonable.

Note that the methods themselves are asynchronous, so if you are going to exit the process after you do the freezing, make sure you wait for callback - otherwise you'll get an empty file.

reeze queue
ler.queue.freeze("mysavedqueue.json", function() {
process.exit();


efrost queue
ler.queue.defrost("mysavedqueue.json");
Cookies

Simplecrawler now has an internal cookie jar, which collects and resends cookies automatically, and by default.

If you want to turn this off, set the crawler.acceptCookies option to false.

The cookie jar is accessible via crawler.cookies, and is an event emitter itself:

Cookie Events
Contributors

I'd like to extend sincere thanks to:

And everybody else who has helped out in some way! :)

Licence

Copyright (c) 2013, Christopher Giffard.

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.