OCRerrcr is a small language model specialized for the detection of OCR error.

OCRerrcr was trained by Eliot Jones for PleIAs on a sample of 1000 documents with labelled OCR errors from open data documents (Finance Commons) and cultural heritage sources (Common Corpus).

To date, OCRerrcr provide the most accurate agnostic OCR error rate estimate. PleIAs has also develop an alternative pipeline for this tasks, OCRoscope, that scale significantly better but also significantly less accurate, especially for document with fewer mistakes.

The model was trained using HPC resources from GENCI–IDRIS (Grant 2023-AD011014736) on Jean-Zay.

The name OCRerrcr (instead of OCRerror) is a playful allusion to a common OCR misreading.

Example

The following is a low-error example sentence taken from Common Corpus:

They did not approach cer, but turned away and passed irom her presence, filled with sorrow and moved with sympathy, which her intense emotions seemed to communicate to even these thoughtless young men of the tho plains.

And the OCRerrcr detection (with formatting for clarity):

They did not approach <er>cer,</er> but turned away and passed <er>irom</er> her presence, filled with sorrow and moved with sympathy, which her intense emotions seemed to communicate to even these thoughtless young men of the <er>tho</er> plains.

Downloads last month: 24

Safetensors

Model size

0.4B params

Tensor type

F32

Collection including PleIAs/OCRerrcr

Bad Data Toolbox

Collection

PleIAs collection of models for the data processing of challenging document and data sources. • 5 items • Updated Jul 18, 2024 • 19