|
--- |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: distilgpt2-finetuned-finance |
|
results: [] |
|
license: apache-2.0 |
|
datasets: |
|
- causal-lm/finance |
|
- gbharti/finance-alpaca |
|
- PaulAdversarial/all_news_finance_sm_1h2023 |
|
- winddude/reddit_finance_43_250k |
|
language: |
|
- en |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# distilgpt2-finetuned-finance |
|
|
|
This model is a fine-tuned version of distilgpt2 on the the combination of 4 different finance datasets: |
|
- [causal-lm/finance](https://huggingface.co/datasets/causal-lm/finance) |
|
- [gbharti/finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca) |
|
- [PaulAdversarial/all_news_finance_sm_1h2023](https://huggingface.co/datasets/PaulAdversarial/all_news_finance_sm_1h2023) |
|
- [winddude/reddit_finance_43_250k](https://huggingface.co/datasets/winddude/reddit_finance_43_250k) |
|
|
|
|
|
## Training and evaluation data |
|
|
|
One can reproduce the dataset using the following code: |
|
|
|
```python |
|
# load dataset |
|
dataset_1 = load_dataset("gbharti/finance-alpaca") |
|
dataset_2 = load_dataset("PaulAdversarial/all_news_finance_sm_1h2023") |
|
dataset_3 = load_dataset("winddude/reddit_finance_43_250k") |
|
dataset_4 = load_dataset("causal-lm/finance") |
|
|
|
# create a column called text |
|
dataset_1 = dataset_1.map( |
|
lambda example: {"text": example["instruction"] + " " + example["output"]}, |
|
num_proc=4, |
|
) |
|
dataset_1 = dataset_1.remove_columns(["input", "instruction", "output"]) |
|
|
|
dataset_2 = dataset_2.map( |
|
lambda example: {"text": example["title"] + " " + example["description"]}, |
|
num_proc=4, |
|
) |
|
dataset_2 = dataset_2.remove_columns( |
|
["_id", "main_domain", "title", "description", "created_at"] |
|
) |
|
|
|
dataset_3 = dataset_3.map( |
|
lambda example: { |
|
"text": example["title"] + " " + example["selftext"] + " " + example["body"] |
|
}, |
|
num_proc=4, |
|
) |
|
dataset_3 = dataset_3.remove_columns( |
|
[ |
|
"id", |
|
"title", |
|
"selftext", |
|
"z_score", |
|
"normalized_score", |
|
"subreddit", |
|
"body", |
|
"comment_normalized_score", |
|
"combined_score", |
|
] |
|
) |
|
|
|
dataset_4 = dataset_4.map( |
|
lambda example: {"text": example["instruction"] + " " + example["output"]}, |
|
num_proc=4, |
|
) |
|
dataset_4 = dataset_4.remove_columns(["input", "instruction", "output"]) |
|
|
|
# combine and split train test sets |
|
combined_dataset = concatenate_datasets( |
|
[ |
|
dataset_1["train"], |
|
dataset_2["train"], |
|
dataset_3["train"], |
|
dataset_4["train"], |
|
dataset_4["validation"], |
|
] |
|
) |
|
|
|
datasets = combined_dataset.train_test_split(test_size=0.2) |
|
|
|
``` |
|
|
|
## Inference example |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
generator = pipeline(model="lxyuan/distilgpt2-finetuned-finance") |
|
|
|
generator("Tesla is", |
|
pad_token_id=generator.tokenizer.eos_token_id, |
|
max_new_tokens=200, |
|
num_return_sequences=2 |
|
) |
|
|
|
>>> |
|
{'generated_text': |
|
'Tesla is likely going to have a "market crash" over 20 years - I believe I\'m just not |
|
sure how this is going to affect the world. \n\nHowever, I would like to see this play out |
|
as a global financial crisis. With US interest rates already high, a crash in global real |
|
estate prices means that people are likely to feel pressure on assets that are less well |
|
served by the assets the US government gives them. \n\nWould these things help you in your |
|
retirement? I\'m fairly new to Wall Street, and it makes me think that you should have a |
|
bit more control over your assets (I’m not super involved in stock picking, but I’ve heard |
|
many times that governments can help their citizens), right? As another commenter has put |
|
it: there\'s something called a market crash that could occur in the second world country |
|
for most markets (I don\'t know how that would fit under US laws if I had done all of the |
|
above. \n\n' |
|
}, |
|
{'generated_text': |
|
"Tesla is on track to go from 1.46 to 1.79 per cent growth in Q3 (the fastest pace so far |
|
in the US), which will push down the share price.\n\nWhile the dividend could benefit Amazon’s |
|
growth, earnings also aren’t expected to be high at all, the company's annual earnings could |
|
be an indication that investors have a strong plan to boost sales by the end of the year if |
|
earnings season continues.\n\nThe latest financials showed earnings as of the end of July, |
|
followed by the earnings guidance from analysts at the Canadian Real Estate Association, which |
|
showed that Amazon’s revenues were up over $1.8 Trillion, which is a far cry from what was |
|
expected in early Q1.\n\nAmazon has grown the share price by as much as 1.6 percent since June |
|
2020. Analysts had predicted that earnings growth in the stock would drop to 0.36 per cent for |
|
2020, which would lead to Amazon’" |
|
} |
|
``` |
|
|
|
## Training procedure |
|
|
|
Notebook link: [here](https://github.com/LxYuan0420/nlp/blob/main/notebooks/finetune_distilgpt2_language_model_on_finance_dataset.ipynb) |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 2e-05 |
|
- train_batch_size: 4 |
|
- eval_batch_size: 4 |
|
- seed: 42 |
|
- gradient_accumulation_steps: 64 |
|
- total_train_batch_size: 256 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 50 |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.30.2 |
|
- Pytorch 2.0.1+cu117 |
|
- Datasets 2.13.1 |
|
- Tokenizers 0.13.3 |