|
---
|
|
license: openrail
|
|
---
|
|
# trocr-old-russian
|
|
## Info
|
|
The model is trained to recognize printed texts in Old Russian language
|
|
- Use microsoft/trocr-small-printed as base model for fine-tune.
|
|
- Fine-tune on 636k text images from dataset: https://huggingface.co/datasets/nevmenandr/russian-old-orthography-ocr
|
|
|
|
## Usage
|
|
### Base-usage
|
|
```python
|
|
from PIL import Image
|
|
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
|
|
|
|
hf_model = VisionEncoderDecoderModel.from_pretrained("Serovvans/trocr-prereform-orthography")
|
|
|
|
image = Image.open("./path/to/your/image")
|
|
|
|
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
|
|
pixel_values = processor(images=image, return_tensors="pt").pixel_values
|
|
|
|
generated_ids = hf_model.generate(pixel_values)
|
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
|
print(generated_text)
|
|
```
|
|
## Usage for recognizing the page
|
|
1. Download the following files from the repository: recognize_page.py, requirements_page.txt
|
|
2. Install tesseract-ocr
|
|
- Linux
|
|
```bash
|
|
sudo apt install tesseract-ocr
|
|
```
|
|
- MacOS
|
|
```bash
|
|
brew install tesseract-ocr
|
|
```
|
|
- Windows
|
|
1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.
|
|
|
|
2. Install this exe in C:\Program Files (x86)\Tesseract-OCR
|
|
|
|
3. Open virtual machine command prompt in windows or anaconda prompt.
|
|
|
|
3. Run this command
|
|
|
|
- Linux
|
|
```bash
|
|
sudo apt-get install poppler-utils
|
|
```
|
|
- MacOS
|
|
```bash
|
|
brew install poppler
|
|
```
|
|
- Windows
|
|
1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date.
|
|
2. Move the extracted directory to the desired place on your system
|
|
3. Add the bin/ directory to your PATH
|
|
4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h
|
|
|
|
4. Install dependencies:
|
|
```bash
|
|
pip install -r requirements_page.txt
|
|
```
|
|
5. Use this code
|
|
```python
|
|
from recognize_page import recognize_page
|
|
|
|
page_path = "cleaned_pages/C2V10/page11.png"
|
|
text = recognize_page(page_path, text_output_path="output/file.txt")
|
|
|
|
print(f"Текст страницы:\n")
|
|
print(text)
|
|
```
|
|
|
|
## Usage for recognizing the book from pdf-file
|
|
1. Download the following files from the repository: recognize_book.py, recognize_page.py, requirements_book.txt
|
|
2. Install tesseract-ocr
|
|
- Linux
|
|
```bash
|
|
sudo apt install tesseract-ocr
|
|
```
|
|
- MacOS
|
|
```bash
|
|
brew install tesseract-ocr
|
|
```
|
|
- Windows
|
|
1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.
|
|
|
|
2. Install this exe in C:\Program Files (x86)\Tesseract-OCR
|
|
|
|
3. Open virtual machine command prompt in windows or anaconda prompt.
|
|
|
|
3. Run this command
|
|
|
|
- Linux
|
|
```bash
|
|
sudo apt-get install poppler-utils
|
|
```
|
|
- MacOS
|
|
```bash
|
|
brew install poppler
|
|
```
|
|
- Windows
|
|
1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date.
|
|
2. Move the extracted directory to the desired place on your system
|
|
3. Add the bin/ directory to your PATH
|
|
4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h
|
|
|
|
4. Install dependencies:
|
|
```bash
|
|
pip install -r requirements_book.txt
|
|
```
|
|
|
|
5. Use this code
|
|
```python
|
|
from recognize_book import recognize_book
|
|
|
|
recognize_book(book_path="path/to/your/book.pdf", text_output_path="book_text.txt")
|
|
```
|
|
|
|
## Metrics on test
|
|
- CER (Char Error Rate) = 0.095
|
|
- WER (Word Error Rate) = 0.298 |