Exploring your dataset

Before you can figure out what kind of research you can do with your dataset and design your experiments, you need to get an idea of both the content and structure of your dataset, or the documents that you want to use to create your machine readable dataset. The purpose of your exploration of the dataset is to answer questions such as: what kind of documents are in your collection? What metadata is available? What metadata could be still added? Are any documents missing? This exploration step is often part in a more iterative process where after manual exploration or inspection of your dataset, you make some changes and then go back to verify if the new dataset is in the state that you want it to be.

There are many approaches to explore your dataset, and you are the best person to decide what fits your use case. This page aims to illustrate several possible options to help you choose your path.

Manual approach

A manual inspection of items in your dataset is generally a good starting point to explore it, and you’ve possibly already started with this before reading this page. A manual inspection helps you visualize what is and what is not in your dataset. However, be careful not to overgeneralize your manual findings. For example, just because the documents you inspected all have some type of metadata already defined may not mean that all documents in your dataset have this metadata available. It often happens that the metadata that was defined for documents has changed over time, so that newer documents already have more metadata available than older documents in your dataset.

Assisted by software

There are also various ways that software can assist you in exploring your dataset. If the data that you are researching is also available through a website, this can be a perfect starting point to find out how the dataset is structured. For example, official Dutch government publications can be searched through via zoek.officielebekendmakingen.nl, but also downloaded programmatically via an API. For some datasets specialized tools exist to interact with them, such as the tools described for Dutch parliamentary data in the Dataset catalogus. These tools can both help you to run and design your experiments, but also to explore your dataset.

More generic tools to help you explore your dataset also exist. One example is Aleph. Aleph is a browser-based tool that can be used that helps journalists and researchers help get insights and visualize relations within large collections of unstructured data such as emails and PDF- or Word-documents. While primary target audience for Aleph are investigative journalists, it might also prove useful for legal researchers that have a similar collection of documents. Using Aleph requires some prior setup however, so you might need technical assistance to use this tool.

If your dataset is already structured in a database, CSV or spreadsheet, the analytical database system DuckDB can be a useful tool to explore your dataset. DuckDB is a database system that can be queried with SQL queries. In-browser tools such as PondPilot, QuackDB, Sekuel (which includes a DuckDB SQL tutorial), and Duck-UI provide a visual interface to try DuckDB out and help you visualize the results of your queries. As an added benefit, these tools require no installation, and some of them work completely locally on your machine - no upload of your data is required.

Using more advanced NLP techniques

One step further is to explore your dataset with more advanced NLP techniques, which can give you a more birds-eye view of the contents of your dataset. One possible approach is to use topic modelling to identify the themes and topics in your dataset. In topic modelling, an algorithm generates a certain number of categories, this amount being decided by the researcher, and sorts out the documents within the dataset conforming to these categories. It does so by identifying clusters or groups of similar words within the text.

Last updated: 16-Jun-2025