Annotating data

Data annotation is the process of manually adding metadata to your dataset. This can include a wide range of tasks, such as segmenting court decisions into specific parts like procedure, facts, and conclusion; the classification of complete documents into several categories; or labeling specific paragraphs in your document to categorize their contents. Manually annotating your data can serve multiple purposes and can thus also be required in different phases in your research, for example:

Labeling manual

Annotating data can be done in different ways, but most important is that you formulate clearly what data you will label with which labels. Before you start annotating your data, it is best to formulate a manual for you (and possibly other annotators) how you will be performing the annotation task. It can happen that you will need to alter your manual during the annotation task because you find new cases that you had not thought of before, but it is important to make explicit how you will be doing your work. This helps yourself deciding edge cases, but is also essential for you and other researchers to appropriately interpret the annotated dataset. Your annotations are worthless if it is unknown how they should be interpreted. Furthermore, a labeling manual is essential when multiple annotators are collectively annotating the same dataset to ensure that data is annotated (relatively) consistent.

Approaches to data annotation

Once you have decided what labels you want to give to your data, you must decide on a technical approach for how you will label your data. Here there are also a wide variety of options, ranging from a simple spreadsheet file to completely integrated annotation suites such as Lawnotation or Label Studio. Each option has its pros and cons, such as the ease of getting started (you might be already familiar with speadsheet software such as Excel, but not yet with Label Studio), ease of use (annotation suites often let you directly interact with the dataset, allow for random sampling and working together with multiple annotators) and output format.

Last updated: 30-Jun-2025