kazzand
/

ru-longformer-base-4096

Inference Endpoints

Model card Files Files and versions Community

ru-longformer-base-4096 / README.md

kazzand's picture

Update README.md

a54ae14 over 1 year ago

|

history blame contribute delete

1.53 kB

	---
	language:
	- ru
	---

	This is a base Longformer model designed for Russian language.
	It was initialized from [blinoff/roberta-base-russian-v0](https://huggingface.co/blinoff/roberta-base-russian-v0) weights and has been modified to support a context length of up to 4096 tokens.
	We fine-tuned it on a dataset of Russian books. For a detailed information check out our post on Habr.

	Model attributes:
	* 12 attention heads
	* 12 hidden layers
	* 4096 tokens length of context

	The model can be used as-is to produce text embeddings or it can be further fine-tuned for a specific downstream task.

	Text embeddings can be produced as follows:

	```python
	# pip install transformers sentencepiece
	import torch
	from transformers import LongformerForMaskedLM, LongformerTokenizerFast

	model = LongformerModel.from_pretrained('kazzand/ru-longformer-base-4096')
	tokenizer = LongformerTokenizerFast.from_pretrained('kazzand/ru-longformer-base-4096')

	def get_cls_embedding(text, model, tokenizer, device='cuda'):
	model.to(device)
	batch = tokenizer(text, return_tensors='pt')

	#set global attention for cls token
	global_attention_mask = [
	[1 if token_id == tokenizer.cls_token_id else 0 for token_id in input_ids]
	for input_ids in batch["input_ids"]
	]

	#add global attention mask to batch
	batch["global_attention_mask"] = torch.tensor(global_attention_mask)

	with torch.no_grad():
	output = model(**batch.to(device))
	return output.last_hidden_state[:,0,:]

	```