Edit model card

TavBERT base model

An Arabic BERT-style masked language model operating over characters, pre-trained by masking spans of characters, similarly to SpanBERT (Joshi et al., 2020).

How to use

import numpy as np
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("tau/tavbert-ar")
tokenizer = AutoTokenizer.from_pretrained("tau/tavbert-ar")

def mask_sentence(sent, span_len=5):
    start_pos = np.random.randint(0, len(sent) - span_len)
    masked_sent = sent[:start_pos] + '[MASK]' * span_len + sent[start_pos + span_len:]
    print("Masked sentence:", masked_sent)
    output = model(**tokenizer.encode_plus(masked_sent, 
                                           return_tensors='pt'))['logits'][0][1:-1]
    preds = [int(x) for x in torch.argmax(torch.softmax(output, axis=1), axis=1)[start_pos:start_pos + span_len]]
    pred_sent = sent[:start_pos] + ''.join(tokenizer.convert_ids_to_tokens(preds)) + sent[start_pos + span_len:]
    print("Model's prediction:", pred_sent)

Training data

OSCAR (Ortiz, 2019) Arabic section (32 GB text, 67 million sentences).

Downloads last month
5
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train tau/tavbert-ar