PM Martijn

Creating your own machine-readable dataset

The documents that you want to analyze are often not available as a readily available machine-readable dataset, but as a collection of human-readable documents (such as PDF, doc/docx, or odt) or web-content (html or possibly other formats). In this case, you will probably need to first collect the documents and then transform them in a machine-readable dataset before you can effectively use NLP tools for your research.

Collecting documents

There are generally two approaches to collect the document that you want to put in your dataset: requesting them or scrape website(s). The first option, to request the documents from the administrator, is generally the most preferred method (for you and the administrator). If there exists an API for the information you need, you can request these using that API. If there is no API or data dump available, reach out to the administrator to request the documents you are looking for. Depending on the administrator, there may be legislation which gives you an enforceable right to access this information. For specifically the Dutch setting, we have general guidelines for finding an appropriate dataset and leveraging relevant freedom of information and open data legislation.

When it is for whatever reason not possible to request the documents from the administrator and the documents are available on a public website, you can also scrape this website to collect the documents. This is generally not the preferred method, as you can never be completely sure that the scraped dataset is actually complete and consistent. Furthermore, the administrator of the website in question might not be happy about a bunch of automated requests to their website. There can also be technical constraints (it may take a long time to scrape all required documents), and legal constraints. Scraping your data generally requires tailor-made software for your specific case. However, software libraries exist to make things easier (such as the popular Beautiful Soup 4 for Python). In the data-collection repository of WetSuite you can find some examples of crawlers for Dutch legal documents.

Transforming a collection of documents into a machine-readable dataset

Once you have collected your documents, you will probably need to perform further processing before you can effectively apply NLP techniques. Just a bunch of documents in formats such as PDF, Word (.doc, .docx), downloaded web-content (html or possibly other formats), or other document formats (.odt) is hard to automatically analyze using software, because such document types are focused for use by humans and not for computers.

You will need to transform these documents into a machine-readable format. In this process, you might also choose to extract only the information from these documents in which you are interested. However, be cautious with selecting only parts of the data, because once you have chosen to exclude information from your dataset, it is difficult to add it back to it without rebuilding the entire dataset.

Before you start, you need to choose an appropriate format for the documents in your dataset. A wide variety of machine-readable formats exist, but for now three are the most relevant: plaintext, JSON and XML. Plaintext files (.txt) are unstructured files with only text (and no formatting like in Word). JSON and XML are both structured file formats which allow for the easy inclusion of metadata and – for XML – structure of your documents. JSON and XML should be the preferred option if you want to enrich your collected documents with metadata. Besides these file formats, it is also a possibility to load your dataset into a (SQL) database such as DuckDB, SQLite or PostgreSQL. Specifically DuckDB can be a very versatile tool, as it can import from and export to many different file formats, requires little effort set up, and is well integrated “in Python or R for efficient interactive data analysis”.

There is no straightforward way to perform this conversion from human-readable to machine-readable documents, as this is entirely dependent on the structure of your documents and which information you want to extract from them. The following questions might help you to guide what you exactly need to build:

  1. What information do you want to extract from the documents? Identify categories of information that is in (all) your documents that you want. For example, you might want to have the original filename and the date mentioned in the document as metadata, and all the text within the document as the data. There might also be some information that you want to ignore, such as headers or footers.
  2. Are the source documents readily readable files (generally speaking: PDF with text layers, docx/doc/odt, or scraped webpages), or are they images or scanned documents in PDF’s without text layers? In the latter case, you will need to perform Optical Character Recognition (OCR) to extract the text from your files. While good software libraries exist and this can yield good results, it is not straightforward. There can be quite some caveats for your specific collection.
  3. What software libraries exist for the source file format that work with your programming language of choice?

[TODO link to relevant notebook]

Last updated: 21-Nov-2024