Apply for community grant: Academic project (gpu and storage)

#1
by BoltzmannEntropy - opened
We present \textbf{QuantumLLMInstruct (QLMMI)}, an innovative dataset featuring over 500,000 meticulously curated instruction-following problem-solution pairs designed specifically for quantum computing -- \textbf{the largest and most comprehensive dataset of its kind}. Originating from over 90 primary seed domains and encompassing hundreds of subdomains autonomously generated by LLMs, QLMMI marks a transformative step in the diversity and richness of quantum computing datasets.

Designed for \textbf{instruction fine-tuning}, QLMMI seeks to significantly improve LLM performance in addressing complex quantum computing challenges across a wide range of quantum physics topics. While Large Language Models (LLMs) have propelled advancements in computational science with datasets like \textbf{Omni-MATH} and \textbf{OpenMathInstruct}, these primarily target Olympiad-level mathematics, leaving quantum computing largely unexplored.

The creation of QLMMI follows a rigorous \textbf{four-stage methodology}. Initially, foundational problems are developed using predefined templates, focusing on critical areas such as synthetic Hamiltonians, QASM code generation, Jordan-Wigner transformations, and Trotter-Suzuki decompositions for well-known models like the Ising and Heisenberg spin chains. Next, detailed and domain-specific solutions are crafted to ensure accuracy and relevance. In the third stage, the dataset is enriched through advanced reasoning techniques, including Chain-of-Thought (CoT) and Task-Oriented Reasoning and Action (ToRA), which enhance problem-solution diversity while adhering to strict mathematical standards. Lastly, a zero-shot Judge LLM performs self-assessments to validate the dataset's quality and reliability, minimizing human oversight requirements.

To foster collaboration in this evolving field, we utilize the \textbf{Qwen-2.5-Coder} family of models, selected for their strong mathematical reasoning capabilities and permissive licensing. \textbf{All associated code and the complete dataset are openly available}, providing a foundation for future breakthroughs in quantum computing applications of LLMs. Our pipeline is generic and can be easily adapted to generate instruction-tuning datasets across any domain.

Sign up or log in to comment