File size: 2,700 Bytes
f666355
 
 
8414af9
 
 
c518d3c
7869bcc
8414af9
 
 
 
 
c235ccd
8414af9
4a0c45a
8414af9
c235ccd
8414af9
 
 
 
 
 
 
 
768ce5a
 
caa85f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96b12f5
768ce5a
 
 
96b12f5
8414af9
768ce5a
 
 
caa85f5
768ce5a
 
 
7869bcc
 
caa85f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7869bcc
97bac1d
7869bcc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---

license: openrail
---

# trocr-old-russian
## Info
The model is trained to recognize printed texts in Old Russian language
- Use microsoft/trocr-small-printed as base model for fine-tune.
- Fine-tune on 636k text images from dataset: https://huggingface.co/datasets/nevmenandr/russian-old-orthography-ocr

## Usage
### Base-usage
```python

from PIL import Image

from transformers import TrOCRProcessor, VisionEncoderDecoderModel



hf_model = VisionEncoderDecoderModel.from_pretrained("Serovvans/trocr-prereform-orthography")



image = Image.open("./path/to/your/image")



processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")

pixel_values = processor(images=image, return_tensors="pt").pixel_values



generated_ids = hf_model.generate(pixel_values)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(generated_text)

```
## Usage for recognizing the page
1. Download the following files from the repository: recognize_page.py, requirements_page.txt
2. Install tesseract-ocr
    - Linux
    ```bash

    sudo apt install tesseract-ocr

    ```

    - MacOS

    ```bash

    brew install tesseract-ocr

    ```

    - Windows

        1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.


        2. Install this exe in C:\Program Files (x86)\Tesseract-OCR

        3. Open virtual machine command prompt in windows or anaconda prompt.

3. Install dependencies: 
```bash

pip install -r requirements_page.txt

```
4. Use this code
```python

from recognize_page import recognize_page



page_path = "cleaned_pages/C2V10/page11.png"

text = recognize_page(page_path, text_output_path="output/file.txt")



print(f"Текст страницы:\n")

print(text)

```

## Usage for recognizing the book from pdf-file
1. Download the following files from the repository: recognize_book.py, recognize_page.py, requirements_book.txt

2. Install tesseract-ocr

    - Linux

    ```bash

    sudo apt install tesseract-ocr

    ```

    - MacOS

    ```bash

    brew install tesseract-ocr

    ```

    - Windows

        1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.



        2. Install this exe in C:\Program Files (x86)\Tesseract-OCR



        3. Open virtual machine command prompt in windows or anaconda prompt.



3. Install dependencies: 

```bash

pip install -r requirements_book.txt
```

4. Use this code

```python

from recognize_book import recognize_book



recognize_book(book_path="path/to/your/book.pdf", text_output_path="book_text.txt")

```

## Metrics on test
- CER (Char Error Rate) = 0.095
- WER (Word Error Rate) = 0.298