File size: 5,420 Bytes
6fb9ee3
 
 
 
 
 
e185be9
 
 
 
 
 
 
 
6fb9ee3
 
 
 
 
 
 
e185be9
 
 
 
 
6fb9ee3
 
e185be9
6fb9ee3
e185be9
6fb9ee3
e185be9
 
 
 
 
 
6fb9ee3
e185be9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6fb9ee3
e185be9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6fb9ee3
 
 
e185be9
 
6fb9ee3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e185be9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
tags:
- generated_from_trainer
model-index:
- name: distilgpt2-finetuned-finance
  results: []
license: apache-2.0
datasets:
- causal-lm/finance
- gbharti/finance-alpaca
- PaulAdversarial/all_news_finance_sm_1h2023
- winddude/reddit_finance_43_250k
language:
- en
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# distilgpt2-finetuned-finance

This model is a fine-tuned version of distilgpt2 on the the combination of 4 different finance datasets:
- [causal-lm/finance](https://huggingface.co/datasets/causal-lm/finance)
- [gbharti/finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca)
- [PaulAdversarial/all_news_finance_sm_1h2023](https://huggingface.co/datasets/PaulAdversarial/all_news_finance_sm_1h2023)
- [winddude/reddit_finance_43_250k](https://huggingface.co/datasets/winddude/reddit_finance_43_250k)


## Training and evaluation data

One can reproduce the dataset using the following code:

```python
# load dataset
dataset_1 = load_dataset("gbharti/finance-alpaca")
dataset_2 = load_dataset("PaulAdversarial/all_news_finance_sm_1h2023")
dataset_3 = load_dataset("winddude/reddit_finance_43_250k")
dataset_4 = load_dataset("causal-lm/finance")

# create a column called text
dataset_1 = dataset_1.map(
    lambda example: {"text": example["instruction"] + " " + example["output"]},
    num_proc=4,
)
dataset_1 = dataset_1.remove_columns(["input", "instruction", "output"])

dataset_2 = dataset_2.map(
    lambda example: {"text": example["title"] + " " + example["description"]},
    num_proc=4,
)
dataset_2 = dataset_2.remove_columns(
    ["_id", "main_domain", "title", "description", "created_at"]
)

dataset_3 = dataset_3.map(
    lambda example: {
        "text": example["title"] + " " + example["selftext"] + " " + example["body"]
    },
    num_proc=4,
)
dataset_3 = dataset_3.remove_columns(
    [
        "id",
        "title",
        "selftext",
        "z_score",
        "normalized_score",
        "subreddit",
        "body",
        "comment_normalized_score",
        "combined_score",
    ]
)

dataset_4 = dataset_4.map(
    lambda example: {"text": example["instruction"] + " " + example["output"]},
    num_proc=4,
)
dataset_4 = dataset_4.remove_columns(["input", "instruction", "output"])

# combine and split train test sets
combined_dataset = concatenate_datasets(
    [
        dataset_1["train"],
        dataset_2["train"],
        dataset_3["train"],
        dataset_4["train"],
        dataset_4["validation"],
    ]
)

datasets = combined_dataset.train_test_split(test_size=0.2)

```

## Inference example

```python
from transformers import pipeline

generator = pipeline(model="lxyuan/distilgpt2-finetuned-finance")

generator("Tesla is",
  pad_token_id=generator.tokenizer.eos_token_id,
  max_new_tokens=200,
  num_return_sequences=2
)

>>>
{'generated_text':
  'Tesla is likely going to have a "market crash" over 20 years - I believe I\'m just not
  sure how this is going to affect the world. \n\nHowever, I would like to see this play out
  as a global financial crisis. With US interest rates already high, a crash in global real
  estate prices means that people are likely to feel pressure on assets that are less well
  served by the assets the US government gives them. \n\nWould these things help you in your
  retirement? I\'m fairly new to Wall Street, and it makes me think that you should have a
  bit more control over your assets (I’m not super involved in stock picking, but I’ve heard
  many times that governments can help their citizens), right? As another commenter has put
  it: there\'s something called a market crash that could occur in the second world country
  for most markets (I don\'t know how that would fit under US laws if I had done all of the
  above. \n\n'
},
{'generated_text':
  "Tesla is on track to go from 1.46 to 1.79 per cent growth in Q3 (the fastest pace so far
in the US), which will push down the share price.\n\nWhile the dividend could benefit Amazon’s
growth, earnings also aren’t expected to be high at all, the company's annual earnings could
be an indication that investors have a strong plan to boost sales by the end of the year if
earnings season continues.\n\nThe latest financials showed earnings as of the end of July,
followed by the earnings guidance from analysts at the Canadian Real Estate Association, which
showed that Amazon’s revenues were up over $1.8 Trillion, which is a far cry from what was
expected in early Q1.\n\nAmazon has grown the share price by as much as 1.6 percent since June
2020. Analysts had predicted that earnings growth in the stock would drop to 0.36 per cent for
2020, which would lead to Amazon’"
}
```

## Training procedure

Notebook link: [here](https://github.com/LxYuan0420/nlp/blob/main/notebooks/finetune_distilgpt2_language_model_on_finance_dataset.ipynb)

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- gradient_accumulation_steps: 64
- total_train_batch_size: 256
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 50

### Framework versions

- Transformers 4.30.2
- Pytorch 2.0.1+cu117
- Datasets 2.13.1
- Tokenizers 0.13.3