scrapd.core package¶

scrapd.core.apd module¶

Define the module containing the function used to scrap data from the APD website.

async scrapd.core.apd.async_retrieve(pages=-1, from_=None, to=None, attempts=1, backoff=1, dump=False)[source]¶

Retrieve fatality data.

Parameters

pages (str) – number of pages to retrieve or -1 for all
from (str) – the start date
to (str) – the end date
attempts (int) – number of attempts per report
backoff (int) – initial backoff time (second)
dump (bool) – dump reports with parsing issues

Returns

the list of fatalities and the number of pages that were read.

Return type

tuple

scrapd.core.apd.extract_traffic_fatalities_page_details_link(news_page)[source]¶

Extract the fatality detail page links from the news page.

Parameters: news_page (str) – html content of the new pages
Returns: a list of links.
Return type: list or None

scrapd.core.apd.fetch_and_parse(session, url, dump=False)[source]¶

Parse a fatality page from a URL.

Parameters

session (aiohttp.ClientSession) – aiohttp session
url (str) – detail page URL

Returns

a dictionary representing a fatality.

Return type

dict

async scrapd.core.apd.fetch_detail_page(session, url)[source]¶

Fetch the content of a detail page.

Parameters

session (aiohttp.ClientSession) – aiohttp session
url (str) – request URL

Returns

the page content.

Return type

str

async scrapd.core.apd.fetch_news_page(session, page=1)[source]¶

Fetch the content of a specific news page from the APD website.

The page number starts at 1.

Parameters

session (aiohttp.ClientSession) – aiohttp session
page (int) – page number to fetch, defaults to 1

Returns

the page content.

Return type

str

scrapd.core.apd.fetch_text(session, url, params=None)[source]¶

Fetch the data from a URL as text.

Parameters

session (aiohttp.ClientSession) – aiohttp session
url (str) – request URL
params (dict) – request paramemters, defaults to None

Returns

the data from a URL as text.

Return type

str

scrapd.core.apd.generate_detail_page_urls(titles)[source]¶

Generate the full URLs of the fatality detail pages.

Parameters: titles (list) – a list of partial link
Returns: a list of full links to the fatality detail pages.
Return type: list

scrapd.core.apd.has_next(news_page)[source]¶

Return True if there is another news page available.

Parameters: news_page (str) – the news page to parse
Returns: True if there is another news page available, False otherwise.
Return type: bool

scrapd.core.apd.parse_page(page, url, dump=False)[source]¶

Parse the page using all parsing methods available.

Parameters

page (str) – the content of the fatality page
url (str) – detail page URL

Returns

a dictionary representing a fatality.

Return type

dict

scrapd.core.constant module¶

Define the scrapd constants.

class scrapd.core.constant.Fields[source]¶

Bases: object

Define the resource constants.

AGE = 'age'¶

CASE = 'case'¶

CRASH = 'crash'¶

DATE = 'date'¶

DECEASED = 'deceased'¶

DOB = 'dob'¶

ETHNICITY = 'ethnicity'¶

FATALITIES = 'fatalities'¶

FIRST_NAME = 'first'¶

GENDER = 'gender'¶

GENERATION = 'generation'¶

LAST_NAME = 'last'¶

LATITUDE = 'latitude'¶

LINK = 'link'¶

LOCATION = 'location'¶

LONGITUDE = 'longitude'¶

MIDDLE_NAME = 'middle'¶

NOTES = 'notes'¶

TIME = 'time'¶

scrapd.core.date_utils module¶

Define a module to manipulate dates.

scrapd.core.date_utils.check_dob(dob)[source]¶

In case that a date only contains 2 digits, determine century.

Parameters: dob (datetime.date) – DOB
Returns: DOB with 19xx or 20xx as appropriate
Return type: datetime.date

scrapd.core.date_utils.compute_age(date, dob)[source]¶

Compute a victim’s age.

Parameters

date (datetime.date) – crash date
dob (datetime.date) – date of birth

Returns

the victim’s age.

Return type

int

scrapd.core.date_utils.from_date(date)[source]¶

Parse the date from a human readable format, with options for the from date.

If the date cannot be parsed, datetime.date.min is returned.
If the day of the month is not specified, the first day is used.

Parameters: date (str) – date
Returns: a date object representing the date.
Return type: datetime.date

scrapd.core.date_utils.is_before(d1, d2)[source]¶

Return True if d1 is strictly before d2.

Parameters

d1 (datetime.date) – date 1
d2 (datetime.date) – date 2

Returns

True is d1 is before d2.

Return type

bool

scrapd.core.date_utils.is_between(date, from_=None, to=None)[source]¶

Check whether a date is comprised between 2 others.

Parameters

date (datetime.date) – date to check
from (datetime.date) – start date, defaults to None
to (datetime.date) – end date, defaults to None

Returns

True if the date is between from_ and to

Return type

bool

scrapd.core.date_utils.parse_date(date, default=None, settings=None)[source]¶

Parse the date from a human readable format.

If no default value is specified and there is an error, an exception is raised. Otherwise the default value is returned.

Parameters

date (str) – date
default (datetime.date) – default value in case the date cannot be parsed.
settings (dict) – a dictionary containing the parsing options. All the available options are defined here: https://dateparser.readthedocs.io/en/latest/dateparser.html#dateparser.conf.Settings.

Returns

a date object representing the date.

Return type

datetime.date

scrapd.core.date_utils.parse_time(time)[source]¶

Parse the time from a human readable format.

Parameters: time (str) – time
Returns: a time object representing the time.
Return type: datetime.time

scrapd.core.date_utils.to_date(date)[source]¶

Parse the date from a human readable format, with options for the to date.

If the date cannot be parsed, datetime.date.max is returned.
If the day of the month is not specified, the last day is used.

Parameters: date (str) – date
Returns: a date object representing the date.
Return type: datetime.date

scrapd.core.formatter module¶

Define the formatter module.

This module contains all the classes with the ability to print the results. They destination depends on the custom formatter used to print the results and can be sdtout, sdterr, a file or even a remote storage if the formatter allows it.

class scrapd.core.formatter.CSVFormatter(format_='json', output=None)[source]¶

Bases: scrapd.core.formatter.Formatter

Define the CSV formatter.

Displays the results as a CSV.

printer(results, **kwargs)[source]¶

Define the printer method.

Parameters: results (list(dict)) – the results to display.

class scrapd.core.formatter.CountFormatter(format_='json', output=None)[source]¶

Bases: scrapd.core.formatter.Formatter

Define the Count formatter.

Simply displays the number of results matching the search criterias.

printer(results, **kwargs)[source]¶

Define the printer method.

Parameters: results (list(dict)) – the results to display.

class scrapd.core.formatter.Formatter(format_='json', output=None)[source]¶

Bases: object

Define the Formatter base class.

The default printer method simply uses the print() function.

formatters = {'count': <class 'scrapd.core.formatter.CountFormatter'>, 'csv': <class 'scrapd.core.formatter.CSVFormatter'>, 'json': <class 'scrapd.core.formatter.JSONFormatter'>, 'python': <class 'scrapd.core.formatter.PythonFormatter'>}¶

print(results, **kwargs)[source]¶

Print the results with the appropriate formatter.

Parameters: results (list(dict)) – the results to display.

printer(results, **kwargs)[source]¶

Define the printer method.

Parameters: results (list(dict)) – the results to display.

class scrapd.core.formatter.JSONFormatter(format_='json', output=None)[source]¶

Bases: scrapd.core.formatter.Formatter

Define the JSON formatter.

Displays the results as JSON. The keys are sorted and an indentation of 2 spaces is set.

printer(results, **kwargs)[source]¶

Define the printer method.

Parameters: results (list(dict)) – the results to display.

class scrapd.core.formatter.PythonFormatter(format_='json', output=None)[source]¶

Bases: scrapd.core.formatter.Formatter

Define the Python formatter.

Displays the results using PrettyPrinter with an indentation of 2 spaces.

printer(results, **kwargs)[source]¶

Define the printer method.

Parameters: results (list(dict)) – the results to display.

scrapd.core.formatter.json_serializers(obj)[source]¶

Convert custom objects to string for serialization.

Return type: str

scrapd.core.formatter.to_json(results)[source]¶

Convert dict of parsed fields to JSON string.

Parameters: dict (results) – results of scraping APD news site
Return type: str

scrapd.core.version module¶

Define a set of utility functions for managing versions.

scrapd.core.version.detect_from_metadata(package)[source]¶

Detect a package version number from the metadata.

If the version number cannot be detected, the function returns 0.

Parameters: package (str) – package name
Returns str: the package version number.