File size: 1,526 Bytes
02c87f7 a54ae14 02c87f7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
---
language:
- ru
---
This is a base Longformer model designed for Russian language.
It was initialized from [blinoff/roberta-base-russian-v0](https://huggingface.co/blinoff/roberta-base-russian-v0) weights and has been modified to support a context length of up to 4096 tokens.
We fine-tuned it on a dataset of Russian books. For a detailed information check out our post on Habr.
Model attributes:
* 12 attention heads
* 12 hidden layers
* 4096 tokens length of context
The model can be used as-is to produce text embeddings or it can be further fine-tuned for a specific downstream task.
Text embeddings can be produced as follows:
```python
# pip install transformers sentencepiece
import torch
from transformers import LongformerForMaskedLM, LongformerTokenizerFast
model = LongformerModel.from_pretrained('kazzand/ru-longformer-base-4096')
tokenizer = LongformerTokenizerFast.from_pretrained('kazzand/ru-longformer-base-4096')
def get_cls_embedding(text, model, tokenizer, device='cuda'):
model.to(device)
batch = tokenizer(text, return_tensors='pt')
#set global attention for cls token
global_attention_mask = [
[1 if token_id == tokenizer.cls_token_id else 0 for token_id in input_ids]
for input_ids in batch["input_ids"]
]
#add global attention mask to batch
batch["global_attention_mask"] = torch.tensor(global_attention_mask)
with torch.no_grad():
output = model(**batch.to(device))
return output.last_hidden_state[:,0,:]
```
|