Instructions for contributors

We are interested in various different kinds of improvement for ScrAPD. Please feel free to raise an Issue if you would like to work on something major to ensure efficient collaboration and avoid duplicate effort.


  • Use the provided templates to file an Issue or a Pull Request.

  • Create a topic branch from where you want to base your work.

  • We follow the Open Stack Coding Guidelines.

  • For formatting the files properly, please use YAPF. In the root directory of the project, run the following command: nox -s format.

  • Make sure you added tests to validate your changes.

  • Run all the tests to ensure nothing else was accidentally broken.

  • Commit messages must start with a capitalized and short summary (max. 50 chars) written in the imperative, followed by an optional, more detailed explanatory text which is separated from the summary by an empty line.

  • Commit messages should follow best practices, including explaining the context of the problem and how it was solved, including in caveats or follow up changes required. They should tell the story of the change and provide readers understanding of what led to it. Please refer to How to Write a Git Commit Message for more details.

  • If your Pull Request is a work in progress, create it as a Draft Pull Request.

  • Any Pull Request inactive for 28 days will be automatically closed. If you need more time to work on it, ask maintainers, to add the appropriate label to it. Use the @scrapd/scrapper mention in the comments.

  • Unless explicitly asked, Pull Request which don’t pass all the CI checks will not be reviewed. Use the @scrapd/scrapper mention in the comments to ask maintainers to help you.

Commit example

Use Docker Hub build environment values

Uses the Docker Hub build environment values in order to ensure the
correct version of ScrAPD is installed into the image.

Fixes scrapd/scrapd#54

The following commit is a good example as:

  1. The fist line is a short description and starts with an imperative verb

  2. The first paragraph describes why this commit may be useful

  3. The last line points to an existing issue and will automatically close it.

Formatting your code

There is also a lot of YAPF plugins available for different editors. Here are a few:

Developer setup

You will need Python 3, invoke and nox:

pip3 install nox invoke

Fork the project, clone the repository.

Setup a local dev environment:

source venv/bin/activate

Run the CI tasks locally

inv nox -s ci

Use inv –list and inv nox to see all the available targets.

The nox tasks can be invoked by running either:

  • inv nox -s {task}, for instance inv nox -s test

  • or directly with nox -s test


How to test the regexes

cd tests/data
curl -sLO
export FIELD="Date:"
grep -h -e "${FIELD}" -C 1 traffic-fatality-* | pbcopy

Paste the result there, choose python, and work on your regex.


The profiling for the project is mostly done using pyinstrument

You can use the nox task to run the profiler automatically:

inv profile

Additionally, you can also generate a flame graph with py-spy. It requires root permissions, therefore must be run with sudo and will prompt you for your password:

inv flame-graph


scrapd comes with a –dump option, which will save the HTML content of the reports being parsed if they contains at least one parsing error either in the twitter fields or the article itself. The dumped files will be stored in a .dump directory


Run the dump-json task from the root of the project:

inv dump-json

In addition to the dumps, this will also create 2 files to help you debug:

  • a dump.json containing the parsed reports in JSON (useful to double check the values)

  • a dump.json.log containing the parsing errors and the names of the files triggering them


You may encounter false positives. For intance some reports do not contain twitter fields, which will obviously trigger an error, but is not something we can act on.

Locate the test named test_dumped_page in the tests/core/ file and update the test parameters with the name of the file you want to debug:

@pytest.mark.parametrize('page_dump', [
  pytest.param('traffic-fatality-1-2', id='dumped'),


You can specify as many files as you want, by adding more pytest.param objects. This can be useful if you notice the same parsing error being reported in various files.

And finally, run pytest with the dump marker:

pytest -s -vvv -n0 -x -m dump