Financbase Financial QA Dataset

Dataset Description

The Financbase Financial QA Dataset is a curated collection of financial question-answering examples designed for training large language models on financial domain tasks. This dataset supports multiple financial AI tasks including question answering, sentiment analysis, and document summarization.

Dataset Summary

  • Total Examples: 1,000+ financial Q&A pairs
  • Format: JSONL (JSON Lines)
  • Language: English
  • Domain: Financial services, SEC filings, investment analysis
  • Tasks: Question answering, sentiment classification, summarization

Dataset Structure

Each example follows the instruction-tuning format with three fields:

{
  "instruction": "Answer the question clearly for a retail investor.",
  "input": "What is EBITDA?",
  "output": "EBITDA stands for Earnings Before Interest, Taxes, Depreciation, and Amortization. It's a measure of a company's operating performance that excludes non-operating expenses..."
}

Supported Tasks

  1. Financial Question Answering

    • Basic financial concepts (EBITDA, P/E ratio, etc.)
    • Investment terminology
    • Market analysis questions
  2. Sentiment Analysis

    • Financial news sentiment classification
    • Earnings report sentiment
    • Market outlook analysis
  3. Document Summarization

    • SEC filing summaries
    • Earnings call summaries
    • Financial report abstracts

Usage

Loading the Dataset

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("Financbase/financbase-10k-jsonl", split="train")

# Access examples
for example in dataset:
    print(f"Instruction: {example['instruction']}")
    print(f"Input: {example['input']}")
    print(f"Output: {example['output']}")

Training with Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset

# Load dataset
dataset = load_dataset("Financbase/financbase-10k-jsonl", split="train")

# Format for training
def format_example(example):
    return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"

# Apply formatting
formatted_dataset = dataset.map(lambda x: {"text": format_example(x)})

Using with PEFT/LoRA

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

# Load base model
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

Data Fields

Field Type Description
instruction string The task instruction or prompt
input string The input context or question
output string The expected response or answer

Data Splits

  • train: 1,000+ examples for training
  • validation: 100+ examples for validation (future release)
  • test: 100+ examples for testing (future release)

Data Collection

Sources

  • SEC 10-K filings (processed and chunked)
  • Financial news articles
  • Investment research reports
  • Financial education materials
  • Curated financial Q&A pairs

Preprocessing

  1. Document Chunking: Long documents split into โ‰ค1800 token chunks
  2. Section Preservation: Maintains document structure and headings
  3. Quality Filtering: Removes low-quality or irrelevant examples
  4. Format Standardization: Ensures consistent instruction/input/output format

Compliance and Safety

Financial Compliance

  • No Investment Advice: Dataset does not contain personalized investment recommendations
  • Educational Purpose: Designed for educational and research use
  • Source Attribution: All examples traceable to original sources
  • Regulatory Compliance: Follows financial data handling best practices

Content Filtering

  • Removed personally identifiable information (PII)
  • Filtered out actionable trading directives
  • Excluded copyrighted material
  • Sanitized sensitive financial data

Evaluation

Metrics

  • Perplexity: Model confidence on financial text
  • BLEU Score: Response quality for summarization tasks
  • Accuracy: Classification accuracy for sentiment analysis
  • ROUGE Score: Summarization quality metrics

Benchmark Tasks

  1. Financial QA: Answer financial questions accurately
  2. Sentiment Analysis: Classify financial sentiment (positive/negative/neutral)
  3. Summarization: Summarize financial documents concisely

Limitations

  • Language: English only
  • Domain: Primarily US financial markets
  • Temporal: Data from 2020-2024 (may become outdated)
  • Bias: Reflects training data biases and limitations

Citation

@dataset{financbase_financial_qa_2024,
  title={Financbase Financial QA Dataset},
  author={Financbase Team},
  year={2024},
  url={https://huggingface.co/datasets/Financbase/financbase-10k-jsonl},
  license={MIT}
}

License

This dataset is released under the MIT License. See LICENSE file for details.

Contact

Changelog

  • v0.1 (2024-12-19): Initial release with 1,000+ financial Q&A examples
  • v0.2 (Planned): Add validation and test splits
  • v0.3 (Planned): Expand to 10,000+ examples with more diverse sources
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support