Preprocessing your data

Preprocessing is the general step of transforming your data so that it is easier to work with. What kind of preprocessing you will need thus depends on your input data and your specific research context. Also, it is not uncommon that you will need multiple different preprocessing steps in your analysis. At the start of your project, you will generally not know what exact preprocessing you will need. If you split up your preprocessing in different steps, it is generally more manageable and more reusable.

To illustrate what preprocessing is and why you need it, let us discuss the following example. You have found an interesting collection of documents published under freedom of information laws. All the documents are scans in a PDF format. You know that the documents in your collection contain references to other documents in the collection, and you want to visualize these relations in a graph. So you start with a collection of scanned documents in PDF format, and the end result that you want is essentially a list of connections between two documents which contain a reference from a source document to a target document.

The first preprocessing step that you will take is to create a machine-readable dataset of your document collection. You will want to extract the text from your PDF files into a structured and machine-readable format, such as XML or JSON. Note that while tools such as tesseract make it possible to use optical character recognition (OCR) to extract text from scanned documents, this process is generally not fully straightforward.

Once you have transformed your collection of scans into a machine-readable format, you will probably want to enrich your dataset by extracting metadata from the text. For example, you might want to extract the date of each document, its title and author from the document text and add it as separately available metadata. Now that your dataset is complete, you can then use NLP tools to look up the references between the documents, to find the information of inter-document links that you need to create your visualization.

Other possible preprocessing steps might be when you want to split up the text in several segments. For example when you already have a dataset of machine readable court judgements, but you want to look at specific segments and will thus need to split up these judgements accordingly.

Last updated: 29-Apr-2025