scrapd.core package

scrapd.core.apd module

Define the module containing the function used to scrap data from the APD website.

scrapd.core.apd.async_retrieve(pages=-1, from_=None, to=None, attempts=1, backoff=1)[source]

Retrieve fatality data.

Parameters
  • pages (str) – number of pages to retrieve or -1 for all

  • from (str) – the start date

  • to (str) – the end date

Returns

the list of fatalities and the number of pages that were read.

Return type

tuple

scrapd.core.apd.common_fatality_parsing(d)[source]

Perform parsing common to Twitter descriptions and page content.

Ensures that the values are all strings and removes the ‘Deceased’ field which does not contain relevant information anymore.

Parameters

d (dict) – the fatality to finish parsing

Returns

A dictionary containing the details information about the fatality with sanitized entries.

Return type

dict, list

Search for the DOB in a deceased field.

Parameters

split_deceased_field (list) – a list representing the deceased field

Returns

the DOB index within the split deceased field.

Return type

int

Extract the fatality detail page links from the news page.

Parameters

news_page (str) – html content of the new pages

Returns

a list of links.

Return type

list or None

scrapd.core.apd.extract_twitter_description_meta(page)[source]

Extract the twitter description from the metadata fields.

Parameters

page (str) – the content of the fatality page

Returns

a string representing the twitter description.

Return type

str

scrapd.core.apd.extract_twitter_tittle_meta(page)[source]

Extract the twitter title from the metadata fields.

Parameters

page (str) – the content of the fatality page

Returns

a string representing the twitter tittle.

Return type

str

scrapd.core.apd.fetch_and_parse(session, url)[source]

Parse a fatality page from a URL.

Parameters
  • session (aiohttp.ClientSession) – aiohttp session

  • url (str) – detail page URL

Returns

a dictionary representing a fatality.

Return type

dict

scrapd.core.apd.fetch_detail_page(session, url)[source]

Fetch the content of a detail page.

Parameters
  • session (aiohttp.ClientSession) – aiohttp session

  • url (str) – request URL

Returns

the page content.

Return type

str

scrapd.core.apd.fetch_news_page(session, page=1)[source]

Fetch the content of a specific news page from the APD website.

The page number starts at 1.

Parameters
  • session (aiohttp.ClientSession) – aiohttp session

  • page (int) – page number to fetch, defaults to 1

Returns

the page content.

Return type

str

scrapd.core.apd.fetch_text(session, url, params=None)[source]

Fetch the data from a URL as text.

Parameters
  • session (aiohttp.ClientSession) – aiohttp session

  • url (str) – request URL

  • params (dict) – request paramemters, defaults to None

Returns

the data from a URL as text.

Return type

str

scrapd.core.apd.generate_detail_page_urls(titles)[source]

Generate the full URLs of the fatality detail pages.

Parameters

titles (list) – a list of partial link

Returns

a list of full links to the fatality detail pages.

Return type

list

scrapd.core.apd.has_next(news_page)[source]

Return True if there is another news page available.

Parameters

news_page (str) – the news page to parse

Returns

True if there is another news page available, False otherwise.

Return type

bool

scrapd.core.apd.match_pattern(text, pattern, group_number=0)[source]

Match a pattern.

Parameters
  • text (str) – the text to match the pattern against

  • regex pattern (compiled) – the pattern to look for

  • group_number (int) – the capturing group number

Returns

a string representing the captured group.

Return type

str

scrapd.core.apd.notes_from_element(deceased, deceased_field_str)[source]

Get Notes from deceased field’s BeautifulSoup element.

Parameters
  • bs4.Beautifulsoup.element.Tag (deceased) – the first <p> tag of the Deceased field of the APD bulletin

  • deceased_field_str – the string corresponding to the Deceased field

Returns

notes from the Deceased field of the APD bulletin

Return type

str

scrapd.core.apd.parse_age_deceased_field(deceased_field)[source]

Parse deceased field assuming it contains an age.

Parameters

deceased_field (str) – the deceased field

Returns

a dictionary representing the deceased field.

Return type

dict

scrapd.core.apd.parse_case_field(page)[source]

Extract the case number from the content of the fatality page.

Parameters

page (str) – the content of the fatality page

Returns

a string representing the case number.

Return type

str

scrapd.core.apd.parse_comma_delimited_deceased_field(deceased_field)[source]

Parse deceased fields seperated with commas.

Parameters

deceased_field (str) – a list representing the deceased field

Returns

a dictionary representing the deceased field.

Return type

dict

scrapd.core.apd.parse_crashes_field(page)[source]

Extract the crash number from the content of the fatality page.

Parameters

page (str) – the content of the fatality page

Returns

a string representing the crash number.

Return type

str

scrapd.core.apd.parse_date_field(page)[source]

Extract the date from the content of the fatality page.

Parameters

page (str) – the content of the fatality page

Returns

a string representing the date.

Return type

str

scrapd.core.apd.parse_deceased_field(soup)[source]

Extract content from deceased field on the fatality page.

Parameters

soup (bs4.BeautifulSoup) – the content of the fatality page

Returns

a tuple containing the tag for the Deceased paragraph and the Deceased field as a string

Return type

tuple

scrapd.core.apd.parse_deceased_field_common(split_deceased_field, fleg)[source]

Parse the deceased field.

Parameters
  • split_deceased_field (list) – [description]

  • fleg (dict) – a dictionary containing First, Last, Ethnicity, Gender fields

Returns

a dictionary representing the deceased field.

Return type

dict

scrapd.core.apd.parse_details_page_notes(details_page_notes)[source]

Clean up a details page notes section.

The purpose of this function is to attempt to extract the sentences about the crash with some level of fidelity, but does not always return a perfectly parsed sentence as the HTML syntax varies widely.

Parameters

details_description (str) – the paragraph after the Deceased information

Returns

A paragraph containing the details of the fatality in sentence form.

Return type

str

scrapd.core.apd.parse_fleg(fleg)[source]

Parse FLEG. fleg stands for First, Last, Ethnicity, Gender.

Parameters

fleg (list) – values representing the fleg.

Returns

a dictionary containing First, Last, Ethnicity, Gender fields

Return type

dict

scrapd.core.apd.parse_location_field(page)[source]

Extract the location information from the content of the fatality page.

Parameters

page (str) – the content of the fatality page

scrapd.core.apd.parse_name(name)[source]

Parse the victim’s name.

Parameters

name (list) – a list reprenting the deceased person’s full name split on space characters

Returns

a dictionary representing just the victim’s first and last name

Return type

dict

scrapd.core.apd.parse_page(page, url)[source]

Parse the page using all parsing methods available.

Parameters
  • page (str) – the content of the fatality page

  • url (str) – detail page URL

Returns

a dictionary representing a fatality.

Return type

dict

scrapd.core.apd.parse_page_content(detail_page, notes_parsed=False)[source]

Parse the detail page to extract fatality information.

Parameters

news_page (str) – the content of the fatality page

Returns

a dictionary representing a fatality.

Return type

dict

scrapd.core.apd.parse_pipe_delimited_deceased_field(deceased_field)[source]

Parse deceased fields separated with pipes.

Parameters

deceased_field (str) – the deceased field as a string.

Returns

a dictionary representing the deceased field.

Return type

dict

scrapd.core.apd.parse_space_delimited_deceased_field(deceased_field)[source]

Parse deceased fields separated with spaces.

Parameters

deceased_field (str) – the deceased field as a string.

Returns

a dictionary representing the deceased field.

Return type

dict

scrapd.core.apd.parse_time_field(page)[source]

Extract the time from the content of the fatality page.

Parameters

page (str) – the content of the fatality page

Returns

a string representing the time.

Return type

str

scrapd.core.apd.parse_twitter_description(twitter_description)[source]

Parse the Twitter description metadata.

The Twitter description contains all the information that we need, and even though it is still unstructured data, it is easier to parse than the data from the detail page.

Parameters

twitter_description (str) – Twitter description embedded in the fatality details page

Returns

A dictionary containing the details information about the fatality.

Return type

dict

scrapd.core.apd.parse_twitter_fields(page)[source]

Parse the Twitter fields on a detail page.

Parameters

page (str) – the content of the fatality page

Returns

a dictionary representing a fatality.

Return type

dict

scrapd.core.apd.parse_twitter_title(twitter_title)[source]

Parse the Twitter tittle metadata.

Parameters

twitter_title (str) – Twitter tittle embedded in the fatality details page

Returns

A dictionary containing the ‘Fatal crashes this year’ field.

Return type

dict

scrapd.core.apd.process_deceased_field(deceased_field)[source]

Parse the deceased field.

At this point the deceased field, if it exists, is garbage as it contains First Name, Last Name, Ethnicity, Gender, D.O.B. and Notes. We need to explode this data into the appropriate fields.

Parameters

deceased_field (str) – the deceased field from the fatality report

Returns

a dictionary representing a deceased field.

Return type

dict

scrapd.core.apd.sanitize_fatality_entity(d)[source]

Clean up a fatality entity.

Returns

A dictionary containing the details information about the fatality with sanitized entries.

Return type

dict

scrapd.core.constant module

Define the scrapd constants.

class scrapd.core.constant.Fields[source]

Bases: object

Define the resource constants.

AGE = 'Age'
CASE = 'Case'
CRASHES = 'Fatal crashes this year'
DATE = 'Date'
DECEASED = 'Deceased'
DOB = 'DOB'
ETHNICITY = 'Ethnicity'
FIRST_NAME = 'First Name'
GENDER = 'Gender'
LAST_NAME = 'Last Name'
LOCATION = 'Location'
NOTES = 'Notes'
TIME = 'Time'

scrapd.core.date_utils module

Define a module to manipulate dates.

scrapd.core.date_utils.check_dob(dob)[source]

In case that a date only contains 2 digits, determine century.

Parameters

dob (datetime.date) – DOB

Returns

DOB with 19xx or 20xx as appropriate

Return type

datetime.date

scrapd.core.date_utils.compute_age(date, dob)[source]

Compute a victim’s age.

Parameters
  • date (datetime.date) – crash date

  • dob (datetime.date) – date of birth

Returns

the victim’s age.

Return type

int

scrapd.core.date_utils.from_date(date)[source]

Parse the date from a human readable format, with options for the from date.

  • If the date cannot be parsed, datetime.date.min is returned.

  • If the day of the month is not specified, the first day is used.

Parameters

date (str) – date

Returns

a date object representing the date.

Return type

datetime.date

scrapd.core.date_utils.is_before(d1, d2)[source]

Return True if d1 is strictly before d2.

Parameters
  • d1 (datetime.date) – date 1

  • d2 (datetime.date) – date 2

Returns

True is d1 is before d2.

Return type

bool

scrapd.core.date_utils.is_between(date, from_=None, to=None)[source]

Check whether a date is comprised between 2 others.

Parameters
  • date (datetime.date) – date to check

  • from (datetime.date) – start date, defaults to None

  • to (datetime.date) – end date, defaults to None

Returns

True if the date is between from_ and to

Return type

bool

scrapd.core.date_utils.parse_date(date, default=None, settings=None)[source]

Parse the date from a human readable format.

If no default value is specified and there is an error, an exception is raised. Otherwise the default value is returned.

Parameters
Returns

a date object representing the date.

Return type

datetime.date

scrapd.core.date_utils.to_date(date)[source]

Parse the date from a human readable format, with options for the to date.

  • If the date cannot be parsed, datetime.date.max is returned.

  • If the day of the month is not specified, the last day is used.

Parameters

date (str) – date

Returns

a date object representing the date.

Return type

datetime.date

scrapd.core.formatter module

Define the formatter module.

This module contains all the classes with the ability to print the results. They destination depends on the custom formatter used to print the results and can be sdtout, sdterr, a file or even a remote storage if the formatter allows it.

class scrapd.core.formatter.CSVFormatter(format_='json', output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Bases: scrapd.core.formatter.Formatter

Define the CSV formatter.

Displays the results as a CSV.

printer(results, **kwargs)[source]

Define the printer method.

Parameters

results (list(dict)) – the results to display.

class scrapd.core.formatter.CountFormatter(format_='json', output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Bases: scrapd.core.formatter.Formatter

Define the Count formatter.

Simply displays the number of results matching the search criterias.

printer(results, **kwargs)[source]

Define the printer method.

Parameters

results (list(dict)) – the results to display.

class scrapd.core.formatter.Formatter(format_='json', output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Bases: object

Define the Formatter base class.

The default printer method simply uses the print() function.

date_serialize(obj)[source]

Convert date objects to string for serialization.

Return type

str

formatters = {'count': <class 'scrapd.core.formatter.CountFormatter'>, 'csv': <class 'scrapd.core.formatter.CSVFormatter'>, 'json': <class 'scrapd.core.formatter.JSONFormatter'>, 'python': <class 'scrapd.core.formatter.PythonFormatter'>}
print(results, **kwargs)[source]

Print the results with the appropriate formatter.

Parameters

results (list(dict)) – the results to display.

printer(results, **kwargs)[source]

Define the printer method.

Parameters

results (list(dict)) – the results to display.

to_json_string(results)[source]

Convert dict of parsed fields to JSON string.

Parameters

dict (results) – results of scraping APD news site

Return type

str

class scrapd.core.formatter.JSONFormatter(format_='json', output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Bases: scrapd.core.formatter.Formatter

Define the JSON formatter.

Displays the results as JSON. The keys are sorted and an indentation of 2 spaces is set.

printer(results, **kwargs)[source]

Define the printer method.

Parameters

results (list(dict)) – the results to display.

class scrapd.core.formatter.PythonFormatter(format_='json', output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]

Bases: scrapd.core.formatter.Formatter

Define the Python formatter.

Displays the results using PrettyPrinter with an indentation of 2 spaces.

printer(results, **kwargs)[source]

Define the printer method.

Parameters

results (list(dict)) – the results to display.

scrapd.core.version module

Define a set of utility functions for managing versions.

scrapd.core.version.detect_from_metadata(package)[source]

Detect a package version number from the metadata.

If the version number cannot be detected, the function returns 0.

Parameters

package (str) – package name

Returns str

the package version number.