Serovvans
/

trocr-prereform-orthography

vision-encoder-decoder

Model card Files Files and versions Community

trocr-prereform-orthography / README.md

Serovvans's picture

Upload README.md

9d2dace verified 15 days ago

|

history blame contribute delete

3.71 kB

	---
	license: openrail
	---
	# trocr-old-russian
	## Info
	The model is trained to recognize printed texts in Old Russian language
	- Use microsoft/trocr-small-printed as base model for fine-tune.
	- Fine-tune on 636k text images from dataset: https://huggingface.co/datasets/nevmenandr/russian-old-orthography-ocr

	## Usage
	### Base-usage
	```python
	from PIL import Image
	from transformers import TrOCRProcessor, VisionEncoderDecoderModel

	hf_model = VisionEncoderDecoderModel.from_pretrained("Serovvans/trocr-prereform-orthography")

	image = Image.open("./path/to/your/image")

	processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
	pixel_values = processor(images=image, return_tensors="pt").pixel_values

	generated_ids = hf_model.generate(pixel_values)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(generated_text)
	```
	## Usage for recognizing the page
	1. Download the following files from the repository: recognize_page.py, requirements_page.txt
	2. Install tesseract-ocr
	- Linux
	```bash
	sudo apt install tesseract-ocr
	```
	- MacOS
	```bash
	brew install tesseract-ocr
	```
	- Windows
	1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.

	2. Install this exe in C:\Program Files (x86)\Tesseract-OCR

	3. Open virtual machine command prompt in windows or anaconda prompt.

	3. Run this command

	- Linux
	```bash
	sudo apt-get install poppler-utils
	```
	- MacOS
	```bash
	brew install poppler
	```
	- Windows
	1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date.
	2. Move the extracted directory to the desired place on your system
	3. Add the bin/ directory to your PATH
	4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h

	4. Install dependencies:
	```bash
	pip install -r requirements_page.txt
	```
	5. Use this code
	```python
	from recognize_page import recognize_page

	page_path = "cleaned_pages/C2V10/page11.png"
	text = recognize_page(page_path, text_output_path="output/file.txt")

	print(f"Текст страницы:\n")
	print(text)
	```

	## Usage for recognizing the book from pdf-file
	1. Download the following files from the repository: recognize_book.py, recognize_page.py, requirements_book.txt
	2. Install tesseract-ocr
	- Linux
	```bash
	sudo apt install tesseract-ocr
	```
	- MacOS
	```bash
	brew install tesseract-ocr
	```
	- Windows
	1. Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.

	2. Install this exe in C:\Program Files (x86)\Tesseract-OCR

	3. Open virtual machine command prompt in windows or anaconda prompt.

	3. Run this command

	- Linux
	```bash
	sudo apt-get install poppler-utils
	```
	- MacOS
	```bash
	brew install poppler
	```
	- Windows
	1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date.
	2. Move the extracted directory to the desired place on your system
	3. Add the bin/ directory to your PATH
	4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h

	4. Install dependencies:
	```bash
	pip install -r requirements_book.txt
	```

	5. Use this code
	```python
	from recognize_book import recognize_book

	recognize_book(book_path="path/to/your/book.pdf", text_output_path="book_text.txt")
	```

	## Metrics on test
	- CER (Char Error Rate) = 0.095
	- WER (Word Error Rate) = 0.298