rrivera1849/LUAR-MUD

Author Style Representations using LUAR.

The LUAR training and evaluation repository can be found here.

This model was trained on the Reddit Million User Dataset (MUD) found here.

Usage

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rrivera1849/LUAR-MUD")
model = AutoModel.from_pretrained("rrivera1849/LUAR-MUD")

# we embed `episodes`, a colletion of documents presumed to come from an author
# NOTE: make sure that `episode_length` consistent across `episode`
batch_size = 3
episode_length = 16
text = [
    ["Foo"] * episode_length,
    ["Bar"] * episode_length,
    ["Zoo"] * episode_length,
]
text = [j for i in text for j in i]
tokenized_text = tokenizer(
    text, 
    max_length=32,
    padding="max_length", 
    truncation=True,
    return_tensors="pt"
)
# inputs size: (batch_size, episode_length, max_token_length)
tokenized_text["input_ids"] = tokenized_text["input_ids"].reshape(batch_size, episode_length, -1)
tokenized_text["attention_mask"] = tokenized_text["attention_mask"].reshape(batch_size, episode_length, -1)
print(tokenized_text["input_ids"].size())       # torch.Size([3, 16, 32])
print(tokenized_text["attention_mask"].size())  # torch.Size([3, 16, 32])

out = model(**tokenized_text)
print(out.size())   # torch.Size([3, 512])

# to get the Transformer attentions:
out, attentions = model(**tokenized_text, output_attentions=True)
print(attentions[0].size())     # torch.Size([48, 12, 32, 32])

Citing & Authors

If you find this model helpful, feel free to cite our publication.

@inproceedings{uar-emnlp2021,
  author    = {Rafael A. Rivera Soto and Olivia Miano and Juanita Ordonez and Barry Chen and Aleem Khan and Marcus Bishop and Nicholas Andrews},
  title     = {Learning Universal Authorship Representations},
  booktitle = {EMNLP},
  year      = {2021},
}

License

LUAR is distributed under the terms of the Apache License (Version 2.0).

All new contributions must be made under the Apache-2.0 licenses.

Downloads last month
617
Safetensors
Model size
82.5M params
Tensor type
F32
·
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Model tree for rrivera1849/LUAR-MUD

Finetunes
1 model

Collection including rrivera1849/LUAR-MUD