File size: 3,712 Bytes
f666355
 
 
8414af9
 
 
c518d3c
7869bcc
8414af9
 
 
 
 
c235ccd
8414af9
4a0c45a
8414af9
c235ccd
8414af9
 
 
 
 
 
 
 
768ce5a
 
caa85f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d2dace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
768ce5a
 
 
9d2dace
8414af9
768ce5a
 
 
caa85f5
768ce5a
 
 
7869bcc
 
caa85f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d2dace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
caa85f5
 
 
9d2dace
 
caa85f5
 
 
 
 
 
7869bcc
97bac1d
7869bcc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---

license: openrail
---

# trocr-old-russian
## Info
The model is trained to recognize printed texts in Old Russian language
- Use microsoft/trocr-small-printed as base model for fine-tune.
- Fine-tune on 636k text images from dataset: https://huggingface.co/datasets/nevmenandr/russian-old-orthography-ocr

## Usage
### Base-usage
```python

from PIL import Image

from transformers import TrOCRProcessor, VisionEncoderDecoderModel



hf_model = VisionEncoderDecoderModel.from_pretrained("Serovvans/trocr-prereform-orthography")



image = Image.open("./path/to/your/image")



processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")

pixel_values = processor(images=image, return_tensors="pt").pixel_values



generated_ids = hf_model.generate(pixel_values)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(generated_text)

```
## Usage for recognizing the page
1. Download the following files from the repository: recognize_page.py, requirements_page.txt
2. Install tesseract-ocr
    - Linux
    ```bash

    sudo apt install tesseract-ocr

    ```

    - MacOS

    ```bash

    brew install tesseract-ocr

    ```

    - Windows

        1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.


        2. Install this exe in C:\Program Files (x86)\Tesseract-OCR

        3. Open virtual machine command prompt in windows or anaconda prompt.

3. Run this command

    - Linux
    ```bash

    sudo apt-get install poppler-utils

    ```

    - MacOS

    ```bash

    brew install poppler

    ```

    - Windows

        1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date.

        2. Move the extracted directory to the desired place on your system

        3. Add the bin/ directory to your PATH

        4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h


4. Install dependencies: 
```bash

pip install -r requirements_page.txt

```
5. Use this code
```python

from recognize_page import recognize_page



page_path = "cleaned_pages/C2V10/page11.png"

text = recognize_page(page_path, text_output_path="output/file.txt")



print(f"Текст страницы:\n")

print(text)

```

## Usage for recognizing the book from pdf-file
1. Download the following files from the repository: recognize_book.py, recognize_page.py, requirements_book.txt

2. Install tesseract-ocr

    - Linux

    ```bash

    sudo apt install tesseract-ocr

    ```

    - MacOS

    ```bash

    brew install tesseract-ocr

    ```

    - Windows

        1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.



        2. Install this exe in C:\Program Files (x86)\Tesseract-OCR



        3. Open virtual machine command prompt in windows or anaconda prompt.



3. Run this command



    - Linux

    ```bash

    sudo apt-get install poppler-utils

    ```

    - MacOS

    ```bash

    brew install poppler

    ```

    - Windows

        1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date.

        2. Move the extracted directory to the desired place on your system

        3. Add the bin/ directory to your PATH

        4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h



4. Install dependencies: 

```bash

pip install -r requirements_book.txt
```



5. Use this code

```python

from recognize_book import recognize_book



recognize_book(book_path="path/to/your/book.pdf", text_output_path="book_text.txt")

```

## Metrics on test
- CER (Char Error Rate) = 0.095
- WER (Word Error Rate) = 0.298