--- license: openrail --- # trocr-old-russian ## Info The model is trained to recognize printed texts in Old Russian language - Use microsoft/trocr-small-printed as base model for fine-tune. - Fine-tune on 636k text images from dataset: https://huggingface.co/datasets/nevmenandr/russian-old-orthography-ocr ## Usage ### Base-usage ```python from PIL import Image from transformers import TrOCRProcessor, VisionEncoderDecoderModel hf_model = VisionEncoderDecoderModel.from_pretrained("Serovvans/trocr-prereform-orthography") image = Image.open("./path/to/your/image") processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed") pixel_values = processor(images=image, return_tensors="pt").pixel_values generated_ids = hf_model.generate(pixel_values) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(generated_text) ``` ## Usage for recognizing the page 1. Download the following files from the repository: recognize_page.py, requirements_page.txt 2. Install tesseract-ocr - Linux ```bash sudo apt install tesseract-ocr ``` - MacOS ```bash brew install tesseract-ocr ``` - Windows 1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki. 2. Install this exe in C:\Program Files (x86)\Tesseract-OCR 3. Open virtual machine command prompt in windows or anaconda prompt. 3. Run this command - Linux ```bash sudo apt-get install poppler-utils ``` - MacOS ```bash brew install poppler ``` - Windows 1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date. 2. Move the extracted directory to the desired place on your system 3. Add the bin/ directory to your PATH 4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h 4. Install dependencies: ```bash pip install -r requirements_page.txt ``` 5. Use this code ```python from recognize_page import recognize_page page_path = "cleaned_pages/C2V10/page11.png" text = recognize_page(page_path, text_output_path="output/file.txt") print(f"Текст страницы:\n") print(text) ``` ## Usage for recognizing the book from pdf-file 1. Download the following files from the repository: recognize_book.py, recognize_page.py, requirements_book.txt 2. Install tesseract-ocr - Linux ```bash sudo apt install tesseract-ocr ``` - MacOS ```bash brew install tesseract-ocr ``` - Windows 1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki. 2. Install this exe in C:\Program Files (x86)\Tesseract-OCR 3. Open virtual machine command prompt in windows or anaconda prompt. 3. Run this command - Linux ```bash sudo apt-get install poppler-utils ``` - MacOS ```bash brew install poppler ``` - Windows 1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date. 2. Move the extracted directory to the desired place on your system 3. Add the bin/ directory to your PATH 4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h 4. Install dependencies: ```bash pip install -r requirements_book.txt ``` 5. Use this code ```python from recognize_book import recognize_book recognize_book(book_path="path/to/your/book.pdf", text_output_path="book_text.txt") ``` ## Metrics on test - CER (Char Error Rate) = 0.095 - WER (Word Error Rate) = 0.298