ContinuumIO/scrapy_scrapers

Name: scrapy_scrapers

Owner: Continuum Analytics, Inc.

Description: Scraper built with Scrapy.

Created: 2015-09-14 15:53:59.0

Updated: 2017-09-17 21:13:24.0

Pushed: 2016-01-11 22:29:40.0

Homepage: null

Size: 204

Language: HTML

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Scraper

This project is designed to handle the broad issues involved in scraping the web for content. It uses scrapy, and may in the future integrate other scrapers like BeautifulSoup and grab.

To begin scraping, create an instance of one of the scraper classes. Currently there is only a BodyScraper, a LinkScraper, and a CustomScraper. Only the CustomScraper requires you to supply the parser_string and parser_dict arguments.

Here is some example code, to be run from the scrapy_scrapers/src directory.

rt scrapy

 spiders.scrapers import CustomScraper


er = CustomScraper(
index="reddit",
start_urls=[
    "http://www.reddit.com/"
],
parser_string="//div[contains(@class,'entry unvoted')]/p[contains(@class,'title')]",
parser_dict={
    "title": "a/text()",
    "link": "a/@href",
},

er.start()

This scraper will go through reddit and grab all of the titles and links to content. Currently the custom scraper requires a parser_string as a starting point and a parser_dict containing items to be scraped.

The scrapers currently only support parsers using Scrapy's xpath tools, but there are plans for including other parsers in the future. Currently, to properly use this tool a user will have to be familiar with xpath parsing.

Requirements:


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.