Useful tools and resources
This page describes a variety of tools and resources which may be useful when you want to employ NLP-based, quantitative, and/or computational research methods when researching governmental data.
NLP toolkits
A NLP toolkit provides a collection of NLP functions which you can use to create your NLP-based experiments or software tools.
Creating & analyzing datasets
- DuckDB can be a useful tool to get to know your dataset and perform complex queries. Also check out awesome-duckdb, a list of DuckDB related resources and tools.
- Aleph is a tool primarily developed for (investigative) journalism, but may prove useful for an initial exploration of a (big) collection of PDF, Word and other documents.
- OpenRefine can help you refine and clean up your dataset.
- LawNotation is an open source annotation platform for analyzing the linguistic and legal characteristics of legal documents. It is based on the open source Label Studio platform.
(Pre)processing your data
- Tesseract is open source software for performing optical character recognition (OCR), which is useful when you need to extract text from scanned documents.
Working with API’s
- When working with a new API, it can be useful to interactively experiment with how the API works. API development tools such as Hoppscotch can be very valuable.
Creating search engines
Sometimes it might be useful to create a custom search engine for your data. Some helpful tools or resources for this may be:
- awesome-selfhosted’s list of search engines
- or check out openbesluitvorming.nl as an example. openbesluitvorming.nl is a search portal by the Open State Foundation for searching through meeting documents of local democratic assemblies.
Last updated: 29-Apr-2025