scrapd.core package

scrapd.core.apd module

Define the module containing the function used to scrap data from the APD website.

scrapd.core.apd.async_retrieve(pages=-1, from_=None, to=None)[source]

Retrieve fatality data.

Search for the DOB in a deceased field.

Parameters:split_deceased_field (list) – a list representing the deceased field
Returns:the DOB index within the split deceased field.
Return type:int

Extract the fatality detail page links from the news page.

Parameters:news_page (str) – html content of the new pages
Returns:a list of links.
Return type:list or None
scrapd.core.apd.fetch_and_parse(session, url)[source]

Parse a fatality page from a URL.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session
  • url (str) – detail page URL
Returns:

a dictionary representing a fatality.

Return type:

dict

scrapd.core.apd.fetch_detail_page(session, url)[source]

Fetch the content of a detail page.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session
  • url (str) – request URL
Returns:

the page content.

Return type:

str

scrapd.core.apd.fetch_news_page(session, page=1)[source]

Fetch the content of a specific news page from the APD website.

The page number starts at 1.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session
  • page (int) – page number to fetch, defaults to 1
Returns:

the page content.

Return type:

str

scrapd.core.apd.fetch_text(session, url, params=None)[source]

Fetch the data from a URL as text.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session
  • url (str) – request URL
  • params (dict) – request paramemters, defaults to None
Returns:

the data from a URL as text.

Return type:

str

scrapd.core.apd.generate_detail_page_urls(titles)[source]

Generate the full URLs of the fatality detail pages.

Parameters:titles (list) – a list of partial link
Returns:a list of full links to the fatality detail pages.
Return type:list
scrapd.core.apd.has_next(news_page)[source]

Return True if there is another news page available.

Parameters:news_page (str) – the news page to parse
Returns:True if there is another news page available, False otherwise.
Return type:bool
scrapd.core.apd.parse_comma_delimited_deceased_field(deceased_field)[source]

Parse deceased fields seperated with commas.

Parameters:split_deceased_field (list) – a list representing the deceased field
Returns:a dictionary representing the deceased field.
Return type:dict
scrapd.core.apd.parse_deceased_field(deceased_field)[source]

Parse the deceased field.

At this point the deceased field, if it exists, is garbage as it contains First Name, Last Name, Ethnicity, Gender, D.O.B. and Notes. We need to explode this data into the appropriate fields.

Parameters:deceased_field (str) – the deceased field from the fatality report
Returns:a dictionary representing a deceased field.
Return type:dict
scrapd.core.apd.parse_details_page_notes(details_page_notes)[source]

Clean up a details page notes section.

The purpose of this function is to attempt to extract the sentences about the crash with some level of fidelity, but does not always return a perfectly parsed sentence as the HTML syntax varies widely.

Parameters:details_description (str) – the paragraph after the Deceased information
Returns:A paragraph containing the details of the fatality in sentence form.
Return type:str
scrapd.core.apd.parse_fleg(fleg)[source]

Parse FLEG.

Parameters:fleg (list) – [description]
Returns:[description]
Return type:dict
scrapd.core.apd.parse_name(name)[source]

Parse the victim’s name.

Parameters:name (list) – a list reprenting the deceased person’s full name split on space characters
Returns:a dictionary representing just the victim’s first and last name
Return type:dict
scrapd.core.apd.parse_page(page)[source]

Parse the page using all parsing methods available.

Parameters:page (str) – the content of the fatality page
scrapd.core.apd.parse_page_content(detail_page, notes_parsed=False)[source]

Parse the detail page to extract fatality information.

Parameters:news_page (str) – the content of the fatality page
Returns:a dictionary representing a fatality.
Return type:dict
scrapd.core.apd.parse_pipe_delimited_deceased_field(deceased_field)[source]

Parse deceased fields separated with pipes.

Parameters:deceased_field (str) – the deceased field as a string.
Returns:a dictionary representing the deceased field.
Return type:dict
scrapd.core.apd.parse_space_delimited_deceased_field(deceased_field)[source]

Parse deceased fields separated with spaces.

Parameters:deceased_field (str) – the deceased field as a string.
Returns:a dictionary representing the deceased field.
Return type:dict
scrapd.core.apd.parse_twitter_description(twitter_description)[source]

Parse the Twitter description metadata.

The Twitter description contains all the information that we need, and even though it is still unstructured data, it is easier to parse than the data from the detail page.

Parameters:twitter_description (str) – Twitter description embedded in the fatality details page
Returns:A dictionary containing the details information about the fatality.
Return type:dict
scrapd.core.apd.parse_twitter_fields(page)[source]

Parse the Twitter fields on a detail page.

Parameters:page (str) – the content of the fatality page
Returns:a dictionary representing a fatality.
Return type:dict
scrapd.core.apd.parse_twitter_title(twitter_title)[source]

Parse the Twitter tittle metadata.

Parameters:twitter_title (str) – Twitter tittle embedded in the fatality details page
Returns:A dictionary containing the ‘Fatal crashes this year’ field.
Return type:dict
scrapd.core.apd.sanitize_fatality_entity(d)[source]

Clean up a fatality entity.

Ensures that the values are all strings and removes the ‘Deceased’ field which does not contain relevant information anymore.

Parameters:d (dict) – the fatality to sanitize
Returns:A dictionary containing the details information about the fatality with sanitized entries.
Return type:dict

scrapd.core.constant module

Define the scrapd constants.

class scrapd.core.constant.Constant[source]

Bases: abc.ABC

Define the constant class.

class scrapd.core.constant.Fields[source]

Bases: scrapd.core.constant.Constant

Define the resource constants.

AGE = 'Age'
CASE = 'Case'
CRASHES = 'Fatal crashes this year'
DATE = 'Date'
DECEASED = 'Deceased'
DOB = 'DOB'
ETHNICITY = 'Ethnicity'
FIRST_NAME = 'First Name'
GENDER = 'Gender'
LAST_NAME = 'Last Name'
LOCATION = 'Location'
NOTES = 'Notes'
TIME = 'Time'

scrapd.core.date_utils module

Define a module to manipulate dates.

scrapd.core.date_utils.check_dob(dob)[source]

In case that a date only contains 2 digits, determine century.

Parameters:dob (datetime.datetime) – DOB
Returns:DOB with 19xx or 20xx as appropriate
Return type:datetime.datetime
scrapd.core.date_utils.clean_date_string(date, is_dob=False)[source]

Parse the date from an unspecified format to the specified format.

Parameters:
  • date (str) – date
  • is_dob (boolean) – True if date is DOB, otherwise False
Returns:

a date string in the uniform %m/%d/%Y format.

Return type:

str

scrapd.core.date_utils.compute_age(date, dob)[source]

Compute a victim’s age.

Parameters:
  • date (str) – crash date
  • dob (str) – date of birth
Returns:

the victim’s age.

Return type:

int

scrapd.core.date_utils.from_date(date)[source]

Parse the date from a human readable format, with options for the from date.

  • If the date cannot be parsed, datetime.datetime.min is returned.
  • If the day of the month is not specified, the first day is used.
Parameters:date (str) – date
Returns:a date object representing the date.
Return type:datetime.datetime
scrapd.core.date_utils.is_in_range(date, from_=None, to=None)[source]

Check whether a date is comprised between 2 others.

Parameters:
  • date (str) – date to vheck
  • from (str) – start date, defaults to None
  • to (str) – end date, defaults to None
Returns:

True if the date is between from_ and to

Return type:

bool

scrapd.core.date_utils.is_posterior(d1, d2)[source]

Return True is d1 is posterior to d2 (i.e. it happened after).

Parameters:
  • d1 (str) – date 1
  • d2 (str) – date 2
Returns:

True is d1 is posterior to d2

Return type:

bool

scrapd.core.date_utils.parse_date(date, default=None, settings=None)[source]

Parse the date from a human readable format.

If no default value is specified and there is an error, an exception is raised. Otherwise the default value is returned.

Parameters:
Returns:

a date object representing the date.

Return type:

datetime.datetime

scrapd.core.date_utils.to_date(date)[source]

Parse the date from a human readable format, with options for the to date.

  • If the date cannot be parsed, datetime.datetime.max is returned.
  • If the day of the month is not specified, the last day is used.
Parameters:date (str) – date
Returns:a date object representing the date.
Return type:datetime.datetime

scrapd.core.formatter module

Define the formatter module.

This module contains all the classes with the ability to print the results. They destination depends on the custom formatter used to print the results and can be sdtout, sdterr, a file or even a remote storage if the formatter allows it.

class scrapd.core.formatter.CSVFormatter(format_='json', output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Bases: scrapd.core.formatter.Formatter

Define the CSV formatter.

Displays the results as a CSV.

printer(results, **kwargs)[source]

Define the printer method.

Parameters:results (list(dict)) – the results to display.
class scrapd.core.formatter.CountFormatter(format_='json', output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Bases: scrapd.core.formatter.Formatter

Define the Count formatter.

Simply displays the number of results matching the search criterias.

printer(results, **kwargs)[source]

Define the printer method.

Parameters:results (list(dict)) – the results to display.
class scrapd.core.formatter.Formatter(format_='json', output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Bases: object

Define the Formatter base class.

The default printer method simply uses the print() function.

formatters = {'count': <class 'scrapd.core.formatter.CountFormatter'>, 'csv': <class 'scrapd.core.formatter.CSVFormatter'>, 'gsheets': <class 'scrapd.core.formatter.GSheetFormatter'>, 'json': <class 'scrapd.core.formatter.JSONFormatter'>, 'python': <class 'scrapd.core.formatter.PythonFormatter'>}
print(results, **kwargs)[source]

Print the results with the appropriate formatter.

Parameters:results (list(dict)) – the results to display.
printer(results, **kwargs)[source]

Define the printer method.

Parameters:results (list(dict)) – the results to display.
class scrapd.core.formatter.GSheetFormatter(format_='json', output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Bases: scrapd.core.formatter.Formatter

Define the GSheet formatter.

Stores the results into a Google Sheets document.

printer(results, **kwargs)[source]

Define the printer method.

Parameters:results (list(dict)) – the results to display.
class scrapd.core.formatter.JSONFormatter(format_='json', output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Bases: scrapd.core.formatter.Formatter

Define the JSON formatter.

Displays the results as JSON. The keys are sorted and an indentation of 2 spaces is set.

printer(results, **kwargs)[source]

Define the printer method.

Parameters:results (list(dict)) – the results to display.
class scrapd.core.formatter.PythonFormatter(format_='json', output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Bases: scrapd.core.formatter.Formatter

Define the Python formatter.

Displays the results using PrettyPrinter with an indentation of 2 spaces.

printer(results, **kwargs)[source]

Define the printer method.

Parameters:results (list(dict)) – the results to display.

scrapd.core.version module

Define a set of utility functions for managing versions.

scrapd.core.version.detect_from_metadata(package)[source]

Detect a package version number from the metadata.

If the version number cannot be detected, the function returns 0.

Parameters:package (str) – package name
Returns str:the package version number.