Text Classification
Transformers
PyTorch
English
roberta
fill-mask
finance
Inference Endpoints
File size: 1,642 Bytes
9bb9651
 
e4079f7
 
 
 
 
 
 
 
 
 
 
9bb9651
e4079f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
license: apache-2.0
datasets:
- financial_phrasebank
- pauri32/fiqa-2018
- zeroshot/twitter-financial-news-sentiment
language:
- en
metrics:
- accuracy
pipeline_tag: text-classification
tags:
- finance
---


We collects financial domain terms from Investopedia's Financia terms dictionary, NYSSCPA's accounting terminology guide 
and Harvey's Hypertextual Finance Glossary to expand RoBERTa's vocab dict. 

Based on added-financial-terms RoBERTa, we pretrained our model on multilple financial corpus:

- Financial Terms
  - [Investopedia's Financia terms dictionary](https://www.investopedia.com/financial-term-dictionary-4769738)
  - [NYSSCPA's accounting terminology guide](https://www.nysscpa.org/professional-resources/accounting-terminology-guide)
  - [Harvey's Hypertextual Finance Glossary](https://people.duke.edu/~charvey/Classes/wpg/glossary.htm)
- Financial Datasets
  - [FPB](https://huggingface.co/datasets/financial_phrasebank)
  - [FiQA SA](https://huggingface.co/datasets/pauri32/fiqa-2018)
  - [SemEval2017 Task5](https://aclanthology.org/S17-2089/)
  - [Twitter Financial News Sentiment](https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment)
- Earnings Call
  2016-2023 NASDAQ 100 components stocks's Earnings Call Transcripts.


In continual pretraining step, we apply following experiments settings to achieve better finetuned results on Four Financial Datasets:

1. Masking Probability: 0.4 (instead of default 0.15)
2. Warmup Steps: 0 (deriving better results than models with warmup steps)
3. Epochs: 1 (is enough in case of overfitting)
4. weight_decay: 0.01
5. Train Batch Size: 64
6. FP16