File size: 3,712 Bytes

f666355
 
 
8414af9
 
 
c518d3c
7869bcc
8414af9
 
 
 
 
c235ccd
8414af9
4a0c45a
8414af9
c235ccd
8414af9
 
 
 
 
 
 
 
768ce5a
 
caa85f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d2dace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
768ce5a
 
 
9d2dace
8414af9
768ce5a
 
 
caa85f5
768ce5a
 
 
7869bcc
 
caa85f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d2dace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
caa85f5
 
 
9d2dace
 
caa85f5
 
 
 
 
 
7869bcc
97bac1d
7869bcc

---

license: openrail
---

# trocr-old-russian
## Info
The model is trained to recognize printed texts in Old Russian language
- Use microsoft/trocr-small-printed as base model for fine-tune.
- Fine-tune on 636k text images from dataset: https://huggingface.co/datasets/nevmenandr/russian-old-orthography-ocr

## Usage
### Base-usage
```python

from PIL import Image

from transformers import TrOCRProcessor, VisionEncoderDecoderModel



hf_model = VisionEncoderDecoderModel.from_pretrained("Serovvans/trocr-prereform-orthography")



image = Image.open("./path/to/your/image")



processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")

pixel_values = processor(images=image, return_tensors="pt").pixel_values



generated_ids = hf_model.generate(pixel_values)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(generated_text)

```
## Usage for recognizing the page
1. Download the following files from the repository: recognize_page.py, requirements_page.txt
2. Install tesseract-ocr
    - Linux
    ```bash

    sudo apt install tesseract-ocr

    ```

    - MacOS

    ```bash

    brew install tesseract-ocr

    ```

    - Windows

        1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.


        2. Install this exe in C:\Program Files (x86)\Tesseract-OCR

        3. Open virtual machine command prompt in windows or anaconda prompt.

3. Run this command

    - Linux
    ```bash

    sudo apt-get install poppler-utils

    ```

    - MacOS

    ```bash

    brew install poppler

    ```

    - Windows

        1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date.

        2. Move the extracted directory to the desired place on your system

        3. Add the bin/ directory to your PATH

        4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h


4. Install dependencies: 
```bash

pip install -r requirements_page.txt

```
5. Use this code
```python

from recognize_page import recognize_page



page_path = "cleaned_pages/C2V10/page11.png"

text = recognize_page(page_path, text_output_path="output/file.txt")



print(f"Текст страницы:\n")

print(text)

```

## Usage for recognizing the book from pdf-file
1. Download the following files from the repository: recognize_book.py, recognize_page.py, requirements_book.txt

2. Install tesseract-ocr

    - Linux

    ```bash

    sudo apt install tesseract-ocr

    ```

    - MacOS

    ```bash

    brew install tesseract-ocr

    ```

    - Windows

        1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.



        2. Install this exe in C:\Program Files (x86)\Tesseract-OCR



        3. Open virtual machine command prompt in windows or anaconda prompt.



3. Run this command



    - Linux

    ```bash

    sudo apt-get install poppler-utils

    ```

    - MacOS

    ```bash

    brew install poppler

    ```

    - Windows

        1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date.

        2. Move the extracted directory to the desired place on your system

        3. Add the bin/ directory to your PATH

        4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h



4. Install dependencies: 

```bash

pip install -r requirements_book.txt
```



5. Use this code

```python

from recognize_book import recognize_book



recognize_book(book_path="path/to/your/book.pdf", text_output_path="book_text.txt")

```

## Metrics on test
- CER (Char Error Rate) = 0.095
- WER (Word Error Rate) = 0.298