OCRerrcr / README.md
Pclanglais's picture
Create README.md
c7e0443 verified
|
raw
history blame
740 Bytes
metadata
license: apache-2.0
language:
  - en
  - fr
  - de

OCRerrcr is a small language model specialized for the detection of OCR error.

OCRerrcr was trained by Elliot Jones for PleIAs on a sample of 1000 documents with labelled OCR errors from open data documents (Finance Commons) and cultural heritage sources (Common Corpus).

To date, OCRerrcr provide the most accurate agnostic OCR error rate estimate. PleIAs has also develop an alternative pipeline for this tasks, OCRoscope, that scale significantly better but also significantly less accurate, especially for document with fewer mistakes.

The name OCRerrcr (instead of OCRerror) is a playful allusion to a common OCR misreading.

Example