Financbase Financial QA Dataset
Dataset Description
The Financbase Financial QA Dataset is a curated collection of financial question-answering examples designed for training large language models on financial domain tasks. This dataset supports multiple financial AI tasks including question answering, sentiment analysis, and document summarization.
Dataset Summary
- Total Examples: 1,000+ financial Q&A pairs
- Format: JSONL (JSON Lines)
- Language: English
- Domain: Financial services, SEC filings, investment analysis
- Tasks: Question answering, sentiment classification, summarization
Dataset Structure
Each example follows the instruction-tuning format with three fields:
{
"instruction": "Answer the question clearly for a retail investor.",
"input": "What is EBITDA?",
"output": "EBITDA stands for Earnings Before Interest, Taxes, Depreciation, and Amortization. It's a measure of a company's operating performance that excludes non-operating expenses..."
}
Supported Tasks
Financial Question Answering
- Basic financial concepts (EBITDA, P/E ratio, etc.)
- Investment terminology
- Market analysis questions
Sentiment Analysis
- Financial news sentiment classification
- Earnings report sentiment
- Market outlook analysis
Document Summarization
- SEC filing summaries
- Earnings call summaries
- Financial report abstracts
Usage
Loading the Dataset
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("Financbase/financbase-10k-jsonl", split="train")
# Access examples
for example in dataset:
print(f"Instruction: {example['instruction']}")
print(f"Input: {example['input']}")
print(f"Output: {example['output']}")
Training with Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
# Load dataset
dataset = load_dataset("Financbase/financbase-10k-jsonl", split="train")
# Format for training
def format_example(example):
return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
# Apply formatting
formatted_dataset = dataset.map(lambda x: {"text": format_example(x)})
Using with PEFT/LoRA
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
# Load base model
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA
model = get_peft_model(model, lora_config)
Data Fields
Field | Type | Description |
---|---|---|
instruction |
string | The task instruction or prompt |
input |
string | The input context or question |
output |
string | The expected response or answer |
Data Splits
- train: 1,000+ examples for training
- validation: 100+ examples for validation (future release)
- test: 100+ examples for testing (future release)
Data Collection
Sources
- SEC 10-K filings (processed and chunked)
- Financial news articles
- Investment research reports
- Financial education materials
- Curated financial Q&A pairs
Preprocessing
- Document Chunking: Long documents split into โค1800 token chunks
- Section Preservation: Maintains document structure and headings
- Quality Filtering: Removes low-quality or irrelevant examples
- Format Standardization: Ensures consistent instruction/input/output format
Compliance and Safety
Financial Compliance
- No Investment Advice: Dataset does not contain personalized investment recommendations
- Educational Purpose: Designed for educational and research use
- Source Attribution: All examples traceable to original sources
- Regulatory Compliance: Follows financial data handling best practices
Content Filtering
- Removed personally identifiable information (PII)
- Filtered out actionable trading directives
- Excluded copyrighted material
- Sanitized sensitive financial data
Evaluation
Metrics
- Perplexity: Model confidence on financial text
- BLEU Score: Response quality for summarization tasks
- Accuracy: Classification accuracy for sentiment analysis
- ROUGE Score: Summarization quality metrics
Benchmark Tasks
- Financial QA: Answer financial questions accurately
- Sentiment Analysis: Classify financial sentiment (positive/negative/neutral)
- Summarization: Summarize financial documents concisely
Limitations
- Language: English only
- Domain: Primarily US financial markets
- Temporal: Data from 2020-2024 (may become outdated)
- Bias: Reflects training data biases and limitations
Citation
@dataset{financbase_financial_qa_2024,
title={Financbase Financial QA Dataset},
author={Financbase Team},
year={2024},
url={https://huggingface.co/datasets/Financbase/financbase-10k-jsonl},
license={MIT}
}
License
This dataset is released under the MIT License. See LICENSE file for details.
Contact
- Organization: Financbase
- Repository: https://huggingface.co/datasets/Financbase/financbase-10k-jsonl
- Issues: Report issues via HuggingFace Hub
Changelog
- v0.1 (2024-12-19): Initial release with 1,000+ financial Q&A examples
- v0.2 (Planned): Add validation and test splits
- v0.3 (Planned): Expand to 10,000+ examples with more diverse sources