Parsers package¶

Submodules¶

Parsers.NYTimes module¶

class Parsers.NYTimes.NYTimes¶

Bases: Parsers.Parsers

static get_article_publish_date(webpage)¶

Parses webpage to return the date the article was published

Parameters:	webpage –
Returns:	Article publish date
Return type:	DateTime object

static get_article_publisher(webpage, url)¶

Parses webpage and/or url to return the publisher of an article

Parameters:	webpage – url –
Returns:	Article publisher, ex: “The New York Times”
Return type:	str

static get_article_section(webpage, url)¶

Parses webpage and/or url to return a list of sections/subsections that the article is in

Parameters:	webpage – url –
Returns:	list of section names in order from most narrow to biggest section
Return type:	list

static get_article_sources(webpage)¶

Parses webpage to extract all sources from an article

Parameters:	webpage –
Returns:	list of sources, typically URLs of the sources
Return type:	list

static get_article_text(webpage)¶

Parses webpage to return the full plaintext of the article

Parameters:	webpage –
Returns:	Plaintext of article
Return type:	str

static get_article_title(webpage)¶

Parses webpage to return the title/headline of an article

Parameters:	webpage –
Returns:	Article headline
Return type:	str

Parsers.TheGuardian module¶

class Parsers.TheGuardian.TheGuardian¶

Bases: Parsers.Parsers

static get_article_publish_date(webpage)¶

Parses webpage to return the date the article was published

Parameters:	webpage –
Returns:	Article publish date
Return type:	DateTime object

static get_article_section(webpage, url)¶

Parses webpage and/or url to return a list of sections/subsections that the article is in

Parameters:	webpage – url –
Returns:	list of section names in order from most narrow to biggest section
Return type:	list

static get_article_sources(webpage)¶

Parses webpage to extract all sources from an article

Parameters:	webpage –
Returns:	list of sources, typically URLs of the sources
Return type:	list

static get_article_text(webpage)¶

Parses webpage to return the full plaintext of the article

Parameters:	webpage –
Returns:	Plaintext of article
Return type:	str

Parsers.TheIndependent module¶

class Parsers.TheIndependent.TheIndependent¶

Bases: Parsers.Parsers

static get_article_author(webpage)¶

Parses webpage to return the author of the article

Parameters:	webpage –
Returns:	Author of the article
Return type:	str

static get_article_section(webpage, url)¶

Parses webpage and/or url to return a list of sections/subsections that the article is in

Parameters:	webpage – url –
Returns:	list of section names in order from most narrow to biggest section
Return type:	list

static get_article_sources(webpage)¶

Parses webpage to extract all sources from an article

Parameters:	webpage –
Returns:	list of sources, typically URLs of the sources
Return type:	list

static get_article_subtitle(webpage)¶

Parses webpage to return the subtitle of an article

Parameters:	webpage –
Returns:	Article subtitle
Return type:	str

static get_article_text(webpage)¶

Parses webpage to return the full plaintext of the article

Parameters:	webpage –
Returns:	Plaintext of article
Return type:	str

Parsers.WashingtonPost module¶

class Parsers.WashingtonPost.WashingtonPost¶

Bases: Parsers.Parsers

static get_article_author(webpage)¶

Parses webpage to return the author of the article

Parameters:	webpage –
Returns:	Author of the article
Return type:	str

static get_article_publish_date(webpage)¶

Parses webpage to return the date the article was published

Parameters:	webpage –
Returns:	Article publish date
Return type:	DateTime object

static get_article_publisher(webpage, url)¶

Parses webpage and/or url to return the publisher of an article

Parameters:	webpage – url –
Returns:	Article publisher, ex: “The New York Times”
Return type:	str

static get_article_section(webpage, url)¶

Parses webpage and/or url to return a list of sections/subsections that the article is in

Parameters:	webpage – url –
Returns:	list of section names in order from most narrow to biggest section
Return type:	list

static get_article_sources(webpage)¶

Parses webpage to extract all sources from an article

Parameters:	webpage –
Returns:	list of sources, typically URLs of the sources
Return type:	list

static get_article_text(webpage)¶

Parses webpage to return the full plaintext of the article

Parameters:	webpage –
Returns:	Plaintext of article
Return type:	str

static get_article_title(webpage)¶

Parses webpage to return the title/headline of an article

Parameters:	webpage –
Returns:	Article headline
Return type:	str

Module contents¶

class Parsers.Parsers¶

Bases: object

static get_article_author(webpage)¶

Parses webpage to return the author of the article

Parameters:	webpage –
Returns:	Author of the article
Return type:	str

static get_article_publish_date(webpage)¶

Parses webpage to return the date the article was published

Parameters:	webpage –
Returns:	Article publish date
Return type:	DateTime object

static get_article_publisher(webpage, url)¶

Parses webpage and/or url to return the publisher of an article

Parameters:	webpage – url –
Returns:	Article publisher, ex: “The New York Times”
Return type:	str

static get_article_section(webpage, url)¶

Parses webpage and/or url to return a list of sections/subsections that the article is in

Parameters:	webpage – url –
Returns:	list of section names in order from most narrow to biggest section
Return type:	list

static get_article_sources(webpage)¶

Parses webpage to extract all sources from an article

Parameters:	webpage –
Returns:	list of sources, typically URLs of the sources
Return type:	list

static get_article_subtitle(webpage)¶

Parses webpage to return the subtitle of an article

Parameters:	webpage –
Returns:	Article subtitle
Return type:	str

static get_article_text(webpage)¶

Parses webpage to return the full plaintext of the article

Parameters:	webpage –
Returns:	Plaintext of article
Return type:	str

static get_article_title(webpage)¶

Parses webpage to return the title/headline of an article

Parameters:	webpage –
Returns:	Article headline
Return type:	str

url_recognized(url)¶

Checks if this parser can parse a given URL

Parameters:	url – URL to check if this parser can recognize it
Returns:	True if this parser can parse the given URL
Return type:	Boolean

Parsers package¶

Submodules¶

Parsers.NYTimes module¶

Parsers.TheGuardian module¶

Parsers.TheIndependent module¶

Parsers.WashingtonPost module¶

Module contents¶

Table Of Contents

Previous topic

Next topic

This Page