---
license: openrail
---
# trocr-old-russian
## Info
The model is trained to recognize printed texts in Old Russian language
- Use microsoft/trocr-small-printed as base model for fine-tune.
- Fine-tune on 636k text images from dataset: https://huggingface.co/datasets/nevmenandr/russian-old-orthography-ocr

## Usage
### Base-usage
```python
from PIL import Image
from transformers import TrOCRProcessor, VisionEncoderDecoderModel

hf_model = VisionEncoderDecoderModel.from_pretrained("Serovvans/trocr-prereform-orthography")

image = Image.open("./path/to/your/image")

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
pixel_values = processor(images=image, return_tensors="pt").pixel_values

generated_ids = hf_model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
```
## Usage for recognizing the page
1. Download the following files from the repository: recognize_page.py, requirements_page.txt
2. Install tesseract-ocr
    - Linux
    ```bash
    sudo apt install tesseract-ocr
    ```
    - MacOS
    ```bash
    brew install tesseract-ocr
    ```
    - Windows
        1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.

        2. Install this exe in C:\Program Files (x86)\Tesseract-OCR

        3. Open virtual machine command prompt in windows or anaconda prompt.

3. Install dependencies: 
```bash
pip install -r requirements_page.txt
```
4. Use this code
```python
from recognize_page import recognize_page

page_path = "cleaned_pages/C2V10/page11.png"
text = recognize_page(page_path, text_output_path="output/file.txt")

print(f"Текст страницы:\n")
print(text)
```

## Usage for recognizing the book from pdf-file
1. Download the following files from the repository: recognize_book.py, recognize_page.py, requirements_book.txt
2. Install tesseract-ocr
    - Linux
    ```bash
    sudo apt install tesseract-ocr
    ```
    - MacOS
    ```bash
    brew install tesseract-ocr
    ```
    - Windows
        1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.

        2. Install this exe in C:\Program Files (x86)\Tesseract-OCR

        3. Open virtual machine command prompt in windows or anaconda prompt.

3. Install dependencies: 
```bash
pip install -r requirements_book.txt
```
4. Use this code
```python
from recognize_book import recognize_book

recognize_book(book_path="path/to/your/book.pdf", text_output_path="book_text.txt")
```

## Metrics on test
- CER (Char Error Rate) = 0.095
- WER (Word Error Rate) = 0.298