Scrapers package

Submodules

Scrapers.NYTimes module

class Scrapers.NYTimes.NYTimes

Bases: Scrapers.Scrapers

static get_article_list(date=None)

Returns list of article URLs from a given scraper subclass

Parameters:date (DateTime) – date of articles to be enumerated or None
Returns:list of URLs of articles to be analyzed
Return type:list

Scrapers.TheGuardian module

class Scrapers.TheGuardian.TheGuardian

Bases: Scrapers.Scrapers

sections = [“uk-news”, “world/europe-news”, “world/americas”, “world/asia”, “world/middleeast”, “world/africa”, “australian-news”, “cities”, “global-development”, “us-news”, “us-news/us-politics”, “politics”, “uk/commentisfree”, “us/commentisfree”, “lifeandstyle/food-and-drink”, “lifeandstyle/health-and-wellbeing”, “lifeandstyle/love-and-sex”, “lifeandstyle/family”, “lifeandstyle/women”, “lifeandstyle/home-and-garden”, “fashion”, “environment/climate-change”, “environment/wildlife”, “environment/energy”, “environment/pollution”, “uk/technology”, “us/technology”, “travel/uk”, “travel/europe”, “travel/us”, “travel/skiing”, “money/property”, “money/savings”, “money/pensions”, “money/debt”, “money/work-and-careers”, “science”]

static get_article_list(date=None)

Returns list of article URLs from a given scraper subclass

Parameters:date (DateTime) – date of articles to be enumerated or None
Returns:list of URLs of articles to be analyzed
Return type:list

Scrapers.TheIndependent module

class Scrapers.TheIndependent.TheIndependent

Bases: Scrapers.Scrapers

static get_article_list(date=None)

Returns list of article URLs from a given scraper subclass

Parameters:date (DateTime) – date of articles to be enumerated or None
Returns:list of URLs of articles to be analyzed
Return type:list

Scrapers.WashingtonPost module

class Scrapers.WashingtonPost.WashingtonPost

Bases: Scrapers.Scrapers

static get_article_list(date=None)

Returns list of article URLs from a given scraper subclass

Parameters:date (DateTime) – date of articles to be enumerated or None
Returns:list of URLs of articles to be analyzed
Return type:list

Module contents

The Scrapers Module is used for getting a list of articles to download, downloading, sending to the parser, and then saving the data to a database

Last Modified:2017-01-18
Author:Michael Dombrowski
class Scrapers.Scrapers

Bases: object

get_article_data(webpage=None)
  • Uses class url variable to download the webpage
  • Parses the webpage using my_parser
  • Saves parsed data into current_article
static get_article_list(date)

Returns list of article URLs from a given scraper subclass

Parameters:date (DateTime) – date of articles to be enumerated or None
Returns:list of URLs of articles to be analyzed
Return type:list
headers = {'DNT': '1', 'Connection': 'keep-alive', 'Referer': 'https://www.google.com/', 'Accept-Encoding': 'gzip, deflate', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', 'Accept-Language': 'en-US,en;q = 0.8', 'Upgrade-Insecure-Requests': '1', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
static normalize_url(url)

Parses a given url and normalizes it to be all lowercase, be http, and not have “www.”.

This is used to prevent duplication of articles due to non-unique URLs.

Parameters:url – unnormalized url
Returns:normalized url
parse_sources(sources)
parse_text_sources(text)
url = ''