|
--- |
|
language: |
|
- ru |
|
--- |
|
|
|
This is a base Longformer model designed for Russian language. |
|
It was initialized from [blinoff/roberta-base-russian-v0](https://huggingface.co/blinoff/roberta-base-russian-v0) weights and has been modified to support a context length of up to 4096 tokens. |
|
We fine-tuned it on a dataset of Russian books. For a detailed information check out our post on Habr. |
|
|
|
Model attributes: |
|
* 12 attention heads |
|
* 12 hidden layers |
|
* 4096 tokens length of context |
|
|
|
The model can be used as-is to produce text embeddings or it can be further fine-tuned for a specific downstream task. |
|
|
|
Text embeddings can be produced as follows: |
|
|
|
```python |
|
# pip install transformers sentencepiece |
|
import torch |
|
from transformers import LongformerForMaskedLM, LongformerTokenizerFast |
|
|
|
model = LongformerModel.from_pretrained('kazzand/ru-longformer-base-4096') |
|
tokenizer = LongformerTokenizerFast.from_pretrained('kazzand/ru-longformer-base-4096') |
|
|
|
def get_cls_embedding(text, model, tokenizer, device='cuda'): |
|
model.to(device) |
|
batch = tokenizer(text, return_tensors='pt') |
|
|
|
#set global attention for cls token |
|
global_attention_mask = [ |
|
[1 if token_id == tokenizer.cls_token_id else 0 for token_id in input_ids] |
|
for input_ids in batch["input_ids"] |
|
] |
|
|
|
#add global attention mask to batch |
|
batch["global_attention_mask"] = torch.tensor(global_attention_mask) |
|
|
|
with torch.no_grad(): |
|
output = model(**batch.to(device)) |
|
return output.last_hidden_state[:,0,:] |
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|