Breakthrough in archival access: Googling through archives within reach

(10 November 2016, Amsterdam)  The ability to google through records is within reach, concludes the final report of the project Full Automatic Archival Access (FAAA). This project studied the opportunities to use new digital technologies to make paper based archives searchable at document-level. Four out of five words were correctly recognized by OCR- and NER-software.

A small selection of the Central Archive of Justice (CABR, National Archives of the Netherlands) was used in the pilot. Project partners the Network of Dutch War Collections, Centre for Language and Speech Technology, National Archives of the Netherlands and IMPACT Centre of Competence are pleasantly surprised with the results.

Eighty-one percent of the words in the test-documents are correctly recognized by software. That means that it is possible to make typed or hybrid text documents with a standard layout automatically, digitally searchable with an acceptable error rate. A standard layout exists of straight lines, a regular ink density and clear contrast between text and background.

The FAAA-project consisted of two steps. First, the approximately one hundred documents from the CABR-archive have been made machine-readable with use of Optical Character Recognition (OCR)-software. Then, the quality of the OCR’ed-text was improved by using Named Entity Recognition (NER)-software. This software is able to select places, persons and organizations and correct them if necessary.

A leap forward in the accessibility of archives, which are currently mostly described on collection or sub-collection level and rarely accessible on document-level. Program director Network of Dutch War Collections Puck Huitsing: “The ability to make archives automatically digitally searchable offers many new opportunities for researchers. Historical collections can be questioned in a way that has never been possible in the paper world”.

The project Full Automatic Archival Access was funded by Archief2020, BRAIN, VSBFonds, VFonds and the Ministry of Health, Welfare and Sport. The final and individual reports are published on the Network of Dutch War Collections-website: http://oorlogsbronnen.nl/volauto.

The announcement in full is here.