Scrapers package¶
Submodules¶
Scrapers.NYTimes module¶
-
class
Scrapers.NYTimes.
NYTimes
¶ Bases:
Scrapers.Scrapers
-
static
get_article_list
(date=None)¶ Returns list of article URLs from a given scraper subclass
Parameters: date (DateTime) – date of articles to be enumerated or None Returns: list of URLs of articles to be analyzed Return type: list
-
static
Scrapers.TheGuardian module¶
-
class
Scrapers.TheGuardian.
TheGuardian
¶ Bases:
Scrapers.Scrapers
sections = [“uk-news”, “world/europe-news”, “world/americas”, “world/asia”, “world/middleeast”, “world/africa”, “australian-news”, “cities”, “global-development”, “us-news”, “us-news/us-politics”, “politics”, “uk/commentisfree”, “us/commentisfree”, “lifeandstyle/food-and-drink”, “lifeandstyle/health-and-wellbeing”, “lifeandstyle/love-and-sex”, “lifeandstyle/family”, “lifeandstyle/women”, “lifeandstyle/home-and-garden”, “fashion”, “environment/climate-change”, “environment/wildlife”, “environment/energy”, “environment/pollution”, “uk/technology”, “us/technology”, “travel/uk”, “travel/europe”, “travel/us”, “travel/skiing”, “money/property”, “money/savings”, “money/pensions”, “money/debt”, “money/work-and-careers”, “science”]
-
static
get_article_list
(date=None)¶ Returns list of article URLs from a given scraper subclass
Parameters: date (DateTime) – date of articles to be enumerated or None Returns: list of URLs of articles to be analyzed Return type: list
-
static
Scrapers.TheIndependent module¶
-
class
Scrapers.TheIndependent.
TheIndependent
¶ Bases:
Scrapers.Scrapers
-
static
get_article_list
(date=None)¶ Returns list of article URLs from a given scraper subclass
Parameters: date (DateTime) – date of articles to be enumerated or None Returns: list of URLs of articles to be analyzed Return type: list
-
static
Scrapers.WashingtonPost module¶
-
class
Scrapers.WashingtonPost.
WashingtonPost
¶ Bases:
Scrapers.Scrapers
-
static
get_article_list
(date=None)¶ Returns list of article URLs from a given scraper subclass
Parameters: date (DateTime) – date of articles to be enumerated or None Returns: list of URLs of articles to be analyzed Return type: list
-
static
Module contents¶
The Scrapers Module is used for getting a list of articles to download, downloading, sending to the parser, and then saving the data to a database
Last Modified: | 2017-01-18 |
---|---|
Author: | Michael Dombrowski |
-
class
Scrapers.
Scrapers
¶ Bases:
object
-
get_article_data
(webpage=None)¶ - Uses class url variable to download the webpage
- Parses the webpage using my_parser
- Saves parsed data into current_article
-
static
get_article_list
(date)¶ Returns list of article URLs from a given scraper subclass
Parameters: date (DateTime) – date of articles to be enumerated or None Returns: list of URLs of articles to be analyzed Return type: list
-
headers
= {'DNT': '1', 'Connection': 'keep-alive', 'Referer': 'https://www.google.com/', 'Accept-Encoding': 'gzip, deflate', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', 'Accept-Language': 'en-US,en;q = 0.8', 'Upgrade-Insecure-Requests': '1', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}¶
-
static
normalize_url
(url)¶ Parses a given url and normalizes it to be all lowercase, be http, and not have “www.”.
This is used to prevent duplication of articles due to non-unique URLs.
Parameters: url – unnormalized url Returns: normalized url
-
parse_sources
(sources)¶
-
parse_text_sources
(text)¶
-
url
= ''¶
-