Scrapers package¶

Submodules¶

Scrapers.NYTimes module¶

class Scrapers.NYTimes.NYTimes¶

Bases: Scrapers.Scrapers

static get_article_list(date=None)¶

Returns list of article URLs from a given scraper subclass

Parameters:	date (DateTime) – date of articles to be enumerated or None
Returns:	list of URLs of articles to be analyzed
Return type:	list

Scrapers.TheGuardian module¶

class Scrapers.TheGuardian.TheGuardian¶

Bases: Scrapers.Scrapers

sections = [“uk-news”, “world/europe-news”, “world/americas”, “world/asia”, “world/middleeast”, “world/africa”, “australian-news”, “cities”, “global-development”, “us-news”, “us-news/us-politics”, “politics”, “uk/commentisfree”, “us/commentisfree”, “lifeandstyle/food-and-drink”, “lifeandstyle/health-and-wellbeing”, “lifeandstyle/love-and-sex”, “lifeandstyle/family”, “lifeandstyle/women”, “lifeandstyle/home-and-garden”, “fashion”, “environment/climate-change”, “environment/wildlife”, “environment/energy”, “environment/pollution”, “uk/technology”, “us/technology”, “travel/uk”, “travel/europe”, “travel/us”, “travel/skiing”, “money/property”, “money/savings”, “money/pensions”, “money/debt”, “money/work-and-careers”, “science”]

static get_article_list(date=None)¶

Returns list of article URLs from a given scraper subclass

Parameters:	date (DateTime) – date of articles to be enumerated or None
Returns:	list of URLs of articles to be analyzed
Return type:	list

Scrapers.TheIndependent module¶

class Scrapers.TheIndependent.TheIndependent¶

Bases: Scrapers.Scrapers

static get_article_list(date=None)¶

Returns list of article URLs from a given scraper subclass

Parameters:	date (DateTime) – date of articles to be enumerated or None
Returns:	list of URLs of articles to be analyzed
Return type:	list

Scrapers.WashingtonPost module¶

class Scrapers.WashingtonPost.WashingtonPost¶

Bases: Scrapers.Scrapers

static get_article_list(date=None)¶

Returns list of article URLs from a given scraper subclass

Parameters:	date (DateTime) – date of articles to be enumerated or None
Returns:	list of URLs of articles to be analyzed
Return type:	list

Module contents¶

The Scrapers Module is used for getting a list of articles to download, downloading, sending to the parser, and then saving the data to a database

Last Modified:	2017-01-18
Author:	Michael Dombrowski

class Scrapers.Scrapers¶

Bases: object

get_article_data(webpage=None)¶

Uses class url variable to download the webpage
Parses the webpage using my_parser
Saves parsed data into current_article

static get_article_list(date)¶

Returns list of article URLs from a given scraper subclass

Parameters:	date (DateTime) – date of articles to be enumerated or None
Returns:	list of URLs of articles to be analyzed
Return type:	list

headers = {'DNT': '1', 'Connection': 'keep-alive', 'Referer': 'https://www.google.com/', 'Accept-Encoding': 'gzip, deflate', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', 'Accept-Language': 'en-US,en;q = 0.8', 'Upgrade-Insecure-Requests': '1', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}¶

static normalize_url(url)¶

Parses a given url and normalizes it to be all lowercase, be http, and not have “www.”.

This is used to prevent duplication of articles due to non-unique URLs.

Parameters:	url – unnormalized url
Returns:	normalized url

parse_sources(sources)¶

parse_text_sources(text)¶

url = ''¶

Scrapers package¶

Submodules¶

Scrapers.NYTimes module¶

Scrapers.TheGuardian module¶

Scrapers.TheIndependent module¶

Scrapers.WashingtonPost module¶

Module contents¶

Table Of Contents

Previous topic

Next topic

This Page