QuantumLLMInstruct / README.md
BoltzmannEntropy
LT
545581c
metadata
title: QuantumLLMInstruct
emoji: 🦀
colorFrom: green
colorTo: indigo
sdk: docker
pinned: false
short_description: 'QuantumLLMInstruct: A 500k LLM Instruction-Tuning Dataset'

QuantumLLMInstruct: A 500k LLM Instruction-Tuning Dataset with Problem-Solution Pairs for Quantum Computing

Dataset Overview

QuantumLLMInstruct (QLMMI) is a groundbreaking dataset designed to fine-tune and evaluate Large Language Models (LLMs) in the domain of quantum computing. This dataset spans 90 primary quantum computing domains and contains over 500,000 rigorously curated instruction-following problem-solution pairs.

The dataset focuses on enhancing reasoning capabilities in LLMs for quantum-specific tasks, including Hamiltonian dynamics, quantum circuit optimization, and Yang-Baxter solvability.

Each entry consists of:

  • A quantum computing problem expressed in natural language and/or LaTeX.
  • A detailed step-by-step solution, designed for precision and clarity.
  • Domain-specific metadata, such as the problem's main domain, sub-domain, and associated tags.

QuantumLLMInstruct Workflow

Data Sources

The dataset leverages cutting-edge methodologies to generate problems and solutions:

  1. Predefined Templates: Problems crafted using robust templates to ensure domain specificity and mathematical rigor.
  2. LLM-Generated Problems: Models such as Qwen-2.5-Coder autonomously generate complex problems across diverse quantum topics, including:
    • Synthetic Hamiltonians
    • QASM code
    • Jordan-Wigner transformations
    • Trotter-Suzuki decompositions
    • Quantum phase estimation
    • Variational Quantum Eigensolvers (VQE)
    • Gibbs state preparation
  3. Advanced Reasoning Techniques: Leveraging Chain-of-Thought (CoT) and Task-Oriented Reasoning and Action (ToRA) frameworks to refine problem-solution pairs.

Structure

The dataset contains the following fields:

  • images: Optional multimodal inputs, such as visualizations of quantum circuits or spin models.
  • problem_text: The quantum computing problem, formatted in plain text or LaTeX.
  • solution: A detailed solution generated by state-of-the-art LLMs.
  • main_domain: The primary quantum domain, e.g., "Quantum Spin Chains" or "Hamiltonian Dynamics."
  • sub_domain: Specific subtopics, e.g., "Ising Models" or "Trotterization."
  • tags: Relevant tags for classification and retrieval.
  • model_name: The name of the model used to generate the problem or solution.
  • timestamp: The date and time of creation.

Key Features

  • Comprehensive Coverage: Spanning 90 primary domains and hundreds of subdomains.
  • High Quality: Problems and solutions validated through advanced reasoning frameworks and Judge LLMs.
  • Open Access: Designed to support researchers, educators, and developers in the field of quantum computing.
  • Scalable Infrastructure: Metadata and structure optimized for efficient querying and usage.

Example Domains

Some of the key domains covered in the dataset include:

  • Synthetic Hamiltonians: Energy computations and time evolution.
  • Quantum Spin Chains: Ising, Heisenberg, and advanced integrable models.
  • Yang-Baxter Solvability: Solving for quantum integrable models.
  • Trotter-Suzuki Decompositions: Efficient simulation of Hamiltonian dynamics.
  • Quantum Phase Estimation: Foundational in quantum algorithms.
  • Variational Quantum Eigensolvers (VQE): Optimization for quantum chemistry.
  • Randomized Circuit Optimization: Enhancing algorithm robustness in noisy conditions.
  • Quantum Thermodynamics: Gibbs state preparation and entropy calculations.

Contributions

This dataset represents a collaborative effort to advance quantum computing research through the use of large-scale LLMs. It offers:

  1. A scalable and comprehensive dataset for fine-tuning LLMs.
  2. Rigorous methodologies for generating and validating quantum problem-solving tasks.
  3. Open-access resources to foster collaboration and innovation in the quantum computing community.

Cite: @dataset{quantumllm_instruct, title={QuantumLLMInstruct: A 500k LLM Instruction-Tuning Dataset with Problem-Solution Pairs for Quantum Computing}, author={Shlomo Kashani}, year={2025}, url={https://huggingface.co/datasets/QuantumLLMInstruct} }