Annotating data

Data annotation is the process of manually adding metadata to your dataset. This can include a wide range of tasks, such as segmenting court decisions into specific parts like procedure, facts, and conclusion; the classification of complete documents into several categories; or labeling specific paragraphs in your document to categorize their contents. Manually annotating your data can serve multiple purposes and can thus also be required in different phases in your research, for example:

You want to train a machine learning model to automate your task. In that case, you will need to create data to train and evaluate your model.
You want to research a phenomenon in your dataset, for which it is hard or infeasible within your research to automate the collection of the required data.
You want to have data to evaluate the way you automated the task, for example when you did this with a rule-based approach or with an LLM.

Labeling manual

Annotating data can be done in different ways, but most important is that you formulate clearly what data you will label with which labels. Before you start annotating your data, it is best to formulate a manual for you (and possibly other annotators) how you will be performing the annotation task. It can happen that you will need to alter your manual during the annotation task because you find new cases that you had not thought of before, but it is important to make explicit how you will be doing your work. This helps yourself deciding edge cases, but is also essential for you and other researchers to appropriately interpret the annotated dataset. Your annotations are worthless if it is unknown how they should be interpreted. Furthermore, a labeling manual is essential when multiple annotators are collectively annotating the same dataset to ensure that data is annotated (relatively) consistent.

Approaches to data annotation

Once you have decided what labels you want to give to your data, you must decide on a technical approach for how you will label your data. Here there are also a wide variety of options, ranging from a simple spreadsheet file to completely integrated annotation suites such as Lawnotation or Label Studio. Each option has its pros and cons, such as the ease of getting started (you might be already familiar with speadsheet software such as Excel, but not yet with Label Studio), ease of use (annotation suites often let you directly interact with the dataset, allow for random sampling and working together with multiple annotators) and output format.

Last updated: 30-Jun-2025