File size: 3,712 Bytes
f666355 8414af9 c518d3c 7869bcc 8414af9 c235ccd 8414af9 4a0c45a 8414af9 c235ccd 8414af9 768ce5a caa85f5 9d2dace 768ce5a 9d2dace 8414af9 768ce5a caa85f5 768ce5a 7869bcc caa85f5 9d2dace caa85f5 9d2dace caa85f5 7869bcc 97bac1d 7869bcc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
---
license: openrail
---
# trocr-old-russian
## Info
The model is trained to recognize printed texts in Old Russian language
- Use microsoft/trocr-small-printed as base model for fine-tune.
- Fine-tune on 636k text images from dataset: https://huggingface.co/datasets/nevmenandr/russian-old-orthography-ocr
## Usage
### Base-usage
```python
from PIL import Image
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
hf_model = VisionEncoderDecoderModel.from_pretrained("Serovvans/trocr-prereform-orthography")
image = Image.open("./path/to/your/image")
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = hf_model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
```
## Usage for recognizing the page
1. Download the following files from the repository: recognize_page.py, requirements_page.txt
2. Install tesseract-ocr
- Linux
```bash
sudo apt install tesseract-ocr
```
- MacOS
```bash
brew install tesseract-ocr
```
- Windows
1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.
2. Install this exe in C:\Program Files (x86)\Tesseract-OCR
3. Open virtual machine command prompt in windows or anaconda prompt.
3. Run this command
- Linux
```bash
sudo apt-get install poppler-utils
```
- MacOS
```bash
brew install poppler
```
- Windows
1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date.
2. Move the extracted directory to the desired place on your system
3. Add the bin/ directory to your PATH
4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h
4. Install dependencies:
```bash
pip install -r requirements_page.txt
```
5. Use this code
```python
from recognize_page import recognize_page
page_path = "cleaned_pages/C2V10/page11.png"
text = recognize_page(page_path, text_output_path="output/file.txt")
print(f"Текст страницы:\n")
print(text)
```
## Usage for recognizing the book from pdf-file
1. Download the following files from the repository: recognize_book.py, recognize_page.py, requirements_book.txt
2. Install tesseract-ocr
- Linux
```bash
sudo apt install tesseract-ocr
```
- MacOS
```bash
brew install tesseract-ocr
```
- Windows
1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.
2. Install this exe in C:\Program Files (x86)\Tesseract-OCR
3. Open virtual machine command prompt in windows or anaconda prompt.
3. Run this command
- Linux
```bash
sudo apt-get install poppler-utils
```
- MacOS
```bash
brew install poppler
```
- Windows
1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date.
2. Move the extracted directory to the desired place on your system
3. Add the bin/ directory to your PATH
4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h
4. Install dependencies:
```bash
pip install -r requirements_book.txt
```
5. Use this code
```python
from recognize_book import recognize_book
recognize_book(book_path="path/to/your/book.pdf", text_output_path="book_text.txt")
```
## Metrics on test
- CER (Char Error Rate) = 0.095
- WER (Word Error Rate) = 0.298 |