--- tags: - generated_from_trainer model-index: - name: distilgpt2-finetuned-finance results: [] license: apache-2.0 datasets: - causal-lm/finance - gbharti/finance-alpaca - PaulAdversarial/all_news_finance_sm_1h2023 - winddude/reddit_finance_43_250k language: - en --- # distilgpt2-finetuned-finance This model is a fine-tuned version of distilgpt2 on the the combination of 4 different finance datasets: - [causal-lm/finance](https://huggingface.co/datasets/causal-lm/finance) - [gbharti/finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca) - [PaulAdversarial/all_news_finance_sm_1h2023](https://huggingface.co/datasets/PaulAdversarial/all_news_finance_sm_1h2023) - [winddude/reddit_finance_43_250k](https://huggingface.co/datasets/winddude/reddit_finance_43_250k) ## Training and evaluation data One can reproduce the dataset using the following code: ```python # load dataset dataset_1 = load_dataset("gbharti/finance-alpaca") dataset_2 = load_dataset("PaulAdversarial/all_news_finance_sm_1h2023") dataset_3 = load_dataset("winddude/reddit_finance_43_250k") dataset_4 = load_dataset("causal-lm/finance") # create a column called text dataset_1 = dataset_1.map( lambda example: {"text": example["instruction"] + " " + example["output"]}, num_proc=4, ) dataset_1 = dataset_1.remove_columns(["input", "instruction", "output"]) dataset_2 = dataset_2.map( lambda example: {"text": example["title"] + " " + example["description"]}, num_proc=4, ) dataset_2 = dataset_2.remove_columns( ["_id", "main_domain", "title", "description", "created_at"] ) dataset_3 = dataset_3.map( lambda example: { "text": example["title"] + " " + example["selftext"] + " " + example["body"] }, num_proc=4, ) dataset_3 = dataset_3.remove_columns( [ "id", "title", "selftext", "z_score", "normalized_score", "subreddit", "body", "comment_normalized_score", "combined_score", ] ) dataset_4 = dataset_4.map( lambda example: {"text": example["instruction"] + " " + example["output"]}, num_proc=4, ) dataset_4 = dataset_4.remove_columns(["input", "instruction", "output"]) # combine and split train test sets combined_dataset = concatenate_datasets( [ dataset_1["train"], dataset_2["train"], dataset_3["train"], dataset_4["train"], dataset_4["validation"], ] ) datasets = combined_dataset.train_test_split(test_size=0.2) ``` ## Inference example ```python from transformers import pipeline generator = pipeline(model="lxyuan/distilgpt2-finetuned-finance") generator("Tesla is", pad_token_id=generator.tokenizer.eos_token_id, max_new_tokens=200, num_return_sequences=2 ) >>> {'generated_text': 'Tesla is likely going to have a "market crash" over 20 years - I believe I\'m just not sure how this is going to affect the world. \n\nHowever, I would like to see this play out as a global financial crisis. With US interest rates already high, a crash in global real estate prices means that people are likely to feel pressure on assets that are less well served by the assets the US government gives them. \n\nWould these things help you in your retirement? I\'m fairly new to Wall Street, and it makes me think that you should have a bit more control over your assets (I’m not super involved in stock picking, but I’ve heard many times that governments can help their citizens), right? As another commenter has put it: there\'s something called a market crash that could occur in the second world country for most markets (I don\'t know how that would fit under US laws if I had done all of the above. \n\n' }, {'generated_text': "Tesla is on track to go from 1.46 to 1.79 per cent growth in Q3 (the fastest pace so far in the US), which will push down the share price.\n\nWhile the dividend could benefit Amazon’s growth, earnings also aren’t expected to be high at all, the company's annual earnings could be an indication that investors have a strong plan to boost sales by the end of the year if earnings season continues.\n\nThe latest financials showed earnings as of the end of July, followed by the earnings guidance from analysts at the Canadian Real Estate Association, which showed that Amazon’s revenues were up over $1.8 Trillion, which is a far cry from what was expected in early Q1.\n\nAmazon has grown the share price by as much as 1.6 percent since June 2020. Analysts had predicted that earnings growth in the stock would drop to 0.36 per cent for 2020, which would lead to Amazon’" } ``` ## Training procedure Notebook link: [here](https://github.com/LxYuan0420/nlp/blob/main/notebooks/finetune_distilgpt2_language_model_on_finance_dataset.ipynb) ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 4 - eval_batch_size: 4 - seed: 42 - gradient_accumulation_steps: 64 - total_train_batch_size: 256 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 50 ### Framework versions - Transformers 4.30.2 - Pytorch 2.0.1+cu117 - Datasets 2.13.1 - Tokenizers 0.13.3