scrapd.core package

scrapd.core.apd module

Define the module containing the function used to scrap data from the APD website.

async scrapd.core.apd.async_retrieve(pages=-1, from_=None, to=None, attempts=1, backoff=1, dump=False)[source]

Retrieve fatality data.

Parameters
  • pages (str) – number of pages to retrieve or -1 for all

  • from (str) – the start date

  • to (str) – the end date

  • attempts (int) – number of attempts per report

  • backoff (int) – initial backoff time (second)

  • dump (bool) – dump reports with parsing issues

Returns

the list of fatalities and the number of pages that were read.

Return type

tuple

Extract the fatality detail page links from the news page.

Parameters

news_page (str) – html content of the new pages

Returns

a list of links.

Return type

list or None

scrapd.core.apd.fetch_and_parse(session, url, dump=False)[source]

Parse a fatality page from a URL.

Parameters
  • session (aiohttp.ClientSession) – aiohttp session

  • url (str) – detail page URL

Returns

a dictionary representing a fatality.

Return type

dict

async scrapd.core.apd.fetch_detail_page(session, url)[source]

Fetch the content of a detail page.

Parameters
  • session (aiohttp.ClientSession) – aiohttp session

  • url (str) – request URL

Returns

the page content.

Return type

str

async scrapd.core.apd.fetch_news_page(session, page=1)[source]

Fetch the content of a specific news page from the APD website.

The page number starts at 1.

Parameters
  • session (aiohttp.ClientSession) – aiohttp session

  • page (int) – page number to fetch, defaults to 1

Returns

the page content.

Return type

str

scrapd.core.apd.fetch_text(session, url, params=None)[source]

Fetch the data from a URL as text.

Parameters
  • session (aiohttp.ClientSession) – aiohttp session

  • url (str) – request URL

  • params (dict) – request paramemters, defaults to None

Returns

the data from a URL as text.

Return type

str

scrapd.core.apd.generate_detail_page_urls(titles)[source]

Generate the full URLs of the fatality detail pages.

Parameters

titles (list) – a list of partial link

Returns

a list of full links to the fatality detail pages.

Return type

list

scrapd.core.apd.has_next(news_page)[source]

Return True if there is another news page available.

Parameters

news_page (str) – the news page to parse

Returns

True if there is another news page available, False otherwise.

Return type

bool

scrapd.core.apd.parse_page(page, url, dump=False)[source]

Parse the page using all parsing methods available.

Parameters
  • page (str) – the content of the fatality page

  • url (str) – detail page URL

Returns

a dictionary representing a fatality.

Return type

dict

scrapd.core.constant module

Define the scrapd constants.

class scrapd.core.constant.Fields[source]

Bases: object

Define the resource constants.

AGE = 'age'
CASE = 'case'
CRASH = 'crash'
DATE = 'date'
DECEASED = 'deceased'
DOB = 'dob'
ETHNICITY = 'ethnicity'
FATALITIES = 'fatalities'
FIRST_NAME = 'first'
GENDER = 'gender'
GENERATION = 'generation'
LAST_NAME = 'last'
LATITUDE = 'latitude'
LOCATION = 'location'
LONGITUDE = 'longitude'
MIDDLE_NAME = 'middle'
NOTES = 'notes'
TIME = 'time'

scrapd.core.date_utils module

Define a module to manipulate dates.

scrapd.core.date_utils.check_dob(dob)[source]

In case that a date only contains 2 digits, determine century.

Parameters

dob (datetime.date) – DOB

Returns

DOB with 19xx or 20xx as appropriate

Return type

datetime.date

scrapd.core.date_utils.compute_age(date, dob)[source]

Compute a victim’s age.

Parameters
  • date (datetime.date) – crash date

  • dob (datetime.date) – date of birth

Returns

the victim’s age.

Return type

int

scrapd.core.date_utils.from_date(date)[source]

Parse the date from a human readable format, with options for the from date.

  • If the date cannot be parsed, datetime.date.min is returned.

  • If the day of the month is not specified, the first day is used.

Parameters

date (str) – date

Returns

a date object representing the date.

Return type

datetime.date

scrapd.core.date_utils.is_before(d1, d2)[source]

Return True if d1 is strictly before d2.

Parameters
  • d1 (datetime.date) – date 1

  • d2 (datetime.date) – date 2

Returns

True is d1 is before d2.

Return type

bool

scrapd.core.date_utils.is_between(date, from_=None, to=None)[source]

Check whether a date is comprised between 2 others.

Parameters
  • date (datetime.date) – date to check

  • from (datetime.date) – start date, defaults to None

  • to (datetime.date) – end date, defaults to None

Returns

True if the date is between from_ and to

Return type

bool

scrapd.core.date_utils.parse_date(date, default=None, settings=None)[source]

Parse the date from a human readable format.

If no default value is specified and there is an error, an exception is raised. Otherwise the default value is returned.

Parameters
Returns

a date object representing the date.

Return type

datetime.date

scrapd.core.date_utils.parse_time(time)[source]

Parse the time from a human readable format.

Parameters

time (str) – time

Returns

a time object representing the time.

Return type

datetime.time

scrapd.core.date_utils.to_date(date)[source]

Parse the date from a human readable format, with options for the to date.

  • If the date cannot be parsed, datetime.date.max is returned.

  • If the day of the month is not specified, the last day is used.

Parameters

date (str) – date

Returns

a date object representing the date.

Return type

datetime.date

scrapd.core.formatter module

Define the formatter module.

This module contains all the classes with the ability to print the results. They destination depends on the custom formatter used to print the results and can be sdtout, sdterr, a file or even a remote storage if the formatter allows it.

class scrapd.core.formatter.CSVFormatter(format_='json', output=None)[source]

Bases: scrapd.core.formatter.Formatter

Define the CSV formatter.

Displays the results as a CSV.

printer(results, **kwargs)[source]

Define the printer method.

Parameters

results (list(dict)) – the results to display.

class scrapd.core.formatter.CountFormatter(format_='json', output=None)[source]

Bases: scrapd.core.formatter.Formatter

Define the Count formatter.

Simply displays the number of results matching the search criterias.

printer(results, **kwargs)[source]

Define the printer method.

Parameters

results (list(dict)) – the results to display.

class scrapd.core.formatter.Formatter(format_='json', output=None)[source]

Bases: object

Define the Formatter base class.

The default printer method simply uses the print() function.

formatters = {'count': <class 'scrapd.core.formatter.CountFormatter'>, 'csv': <class 'scrapd.core.formatter.CSVFormatter'>, 'json': <class 'scrapd.core.formatter.JSONFormatter'>, 'python': <class 'scrapd.core.formatter.PythonFormatter'>}
print(results, **kwargs)[source]

Print the results with the appropriate formatter.

Parameters

results (list(dict)) – the results to display.

printer(results, **kwargs)[source]

Define the printer method.

Parameters

results (list(dict)) – the results to display.

class scrapd.core.formatter.JSONFormatter(format_='json', output=None)[source]

Bases: scrapd.core.formatter.Formatter

Define the JSON formatter.

Displays the results as JSON. The keys are sorted and an indentation of 2 spaces is set.

printer(results, **kwargs)[source]

Define the printer method.

Parameters

results (list(dict)) – the results to display.

class scrapd.core.formatter.PythonFormatter(format_='json', output=None)[source]

Bases: scrapd.core.formatter.Formatter

Define the Python formatter.

Displays the results using PrettyPrinter with an indentation of 2 spaces.

printer(results, **kwargs)[source]

Define the printer method.

Parameters

results (list(dict)) – the results to display.

scrapd.core.formatter.json_serializers(obj)[source]

Convert custom objects to string for serialization.

Return type

str

scrapd.core.formatter.to_json(results)[source]

Convert dict of parsed fields to JSON string.

Parameters

dict (results) – results of scraping APD news site

Return type

str

scrapd.core.version module

Define a set of utility functions for managing versions.

scrapd.core.version.detect_from_metadata(package)[source]

Detect a package version number from the metadata.

If the version number cannot be detected, the function returns 0.

Parameters

package (str) – package name

Returns str

the package version number.