Parsers package

Submodules

Parsers.NYTimes module

class Parsers.NYTimes.NYTimes

Bases: Parsers.Parsers

static get_article_publish_date(webpage)

Parses webpage to return the date the article was published

Parameters:webpage
Returns:Article publish date
Return type:DateTime object
static get_article_publisher(webpage, url)

Parses webpage and/or url to return the publisher of an article

Parameters:
  • webpage
  • url
Returns:

Article publisher, ex: “The New York Times”

Return type:

str

static get_article_section(webpage, url)

Parses webpage and/or url to return a list of sections/subsections that the article is in

Parameters:
  • webpage
  • url
Returns:

list of section names in order from most narrow to biggest section

Return type:

list

static get_article_sources(webpage)

Parses webpage to extract all sources from an article

Parameters:webpage
Returns:list of sources, typically URLs of the sources
Return type:list
static get_article_text(webpage)

Parses webpage to return the full plaintext of the article

Parameters:webpage
Returns:Plaintext of article
Return type:str
static get_article_title(webpage)

Parses webpage to return the title/headline of an article

Parameters:webpage
Returns:Article headline
Return type:str

Parsers.TheGuardian module

class Parsers.TheGuardian.TheGuardian

Bases: Parsers.Parsers

static get_article_publish_date(webpage)

Parses webpage to return the date the article was published

Parameters:webpage
Returns:Article publish date
Return type:DateTime object
static get_article_section(webpage, url)

Parses webpage and/or url to return a list of sections/subsections that the article is in

Parameters:
  • webpage
  • url
Returns:

list of section names in order from most narrow to biggest section

Return type:

list

static get_article_sources(webpage)

Parses webpage to extract all sources from an article

Parameters:webpage
Returns:list of sources, typically URLs of the sources
Return type:list
static get_article_text(webpage)

Parses webpage to return the full plaintext of the article

Parameters:webpage
Returns:Plaintext of article
Return type:str

Parsers.TheIndependent module

class Parsers.TheIndependent.TheIndependent

Bases: Parsers.Parsers

static get_article_author(webpage)

Parses webpage to return the author of the article

Parameters:webpage
Returns:Author of the article
Return type:str
static get_article_section(webpage, url)

Parses webpage and/or url to return a list of sections/subsections that the article is in

Parameters:
  • webpage
  • url
Returns:

list of section names in order from most narrow to biggest section

Return type:

list

static get_article_sources(webpage)

Parses webpage to extract all sources from an article

Parameters:webpage
Returns:list of sources, typically URLs of the sources
Return type:list
static get_article_subtitle(webpage)

Parses webpage to return the subtitle of an article

Parameters:webpage
Returns:Article subtitle
Return type:str
static get_article_text(webpage)

Parses webpage to return the full plaintext of the article

Parameters:webpage
Returns:Plaintext of article
Return type:str

Parsers.WashingtonPost module

class Parsers.WashingtonPost.WashingtonPost

Bases: Parsers.Parsers

static get_article_author(webpage)

Parses webpage to return the author of the article

Parameters:webpage
Returns:Author of the article
Return type:str
static get_article_publish_date(webpage)

Parses webpage to return the date the article was published

Parameters:webpage
Returns:Article publish date
Return type:DateTime object
static get_article_publisher(webpage, url)

Parses webpage and/or url to return the publisher of an article

Parameters:
  • webpage
  • url
Returns:

Article publisher, ex: “The New York Times”

Return type:

str

static get_article_section(webpage, url)

Parses webpage and/or url to return a list of sections/subsections that the article is in

Parameters:
  • webpage
  • url
Returns:

list of section names in order from most narrow to biggest section

Return type:

list

static get_article_sources(webpage)

Parses webpage to extract all sources from an article

Parameters:webpage
Returns:list of sources, typically URLs of the sources
Return type:list
static get_article_text(webpage)

Parses webpage to return the full plaintext of the article

Parameters:webpage
Returns:Plaintext of article
Return type:str
static get_article_title(webpage)

Parses webpage to return the title/headline of an article

Parameters:webpage
Returns:Article headline
Return type:str

Module contents

class Parsers.Parsers

Bases: object

static get_article_author(webpage)

Parses webpage to return the author of the article

Parameters:webpage
Returns:Author of the article
Return type:str
static get_article_publish_date(webpage)

Parses webpage to return the date the article was published

Parameters:webpage
Returns:Article publish date
Return type:DateTime object
static get_article_publisher(webpage, url)

Parses webpage and/or url to return the publisher of an article

Parameters:
  • webpage
  • url
Returns:

Article publisher, ex: “The New York Times”

Return type:

str

static get_article_section(webpage, url)

Parses webpage and/or url to return a list of sections/subsections that the article is in

Parameters:
  • webpage
  • url
Returns:

list of section names in order from most narrow to biggest section

Return type:

list

static get_article_sources(webpage)

Parses webpage to extract all sources from an article

Parameters:webpage
Returns:list of sources, typically URLs of the sources
Return type:list
static get_article_subtitle(webpage)

Parses webpage to return the subtitle of an article

Parameters:webpage
Returns:Article subtitle
Return type:str
static get_article_text(webpage)

Parses webpage to return the full plaintext of the article

Parameters:webpage
Returns:Plaintext of article
Return type:str
static get_article_title(webpage)

Parses webpage to return the title/headline of an article

Parameters:webpage
Returns:Article headline
Return type:str
url_recognized(url)

Checks if this parser can parse a given URL

Parameters:url – URL to check if this parser can recognize it
Returns:True if this parser can parse the given URL
Return type:Boolean