syncIAL🍏

Community Article Published February 4, 2025

A Multi-Purpose Synthetic Debate and Argument Mapping Corpus

🛢️ Dataset at HF Hub
👩‍💻 Python Code Repo
🏋️‍♀️ Distilled ML Dataset

What exactly is syncIALO?

syncIALO is a collection of synthetic argument mapping datasets. Its first and primary corpus (uninspiringly called synthetic_corpus-001) contains

>600k claims (aka arguments), which are organized in
>1000 argument maps.

syncIALO argument maps are directed graphs: nodes represent claims and labeled edges indicate that one claim supports or attacks another one.

These argument maps can be easily loaded and processed with networkx.

from huggingface_hub import hf_hub_download
import json
import networkx as nx
from pathlib import Path

path = Path(hf_hub_download(
    repo_id="DebateLabKIT/syncialo-raw",
    filename="data/synthetic_corpus-001/eval/debate-eval-0001/node_link_data-debate-eval-0001.json"))
argmap = nx.node_link_graph(json.loads(path.read_text()))

type(argmap)
# >>> networkx.classes.digraph.DiGraph

argmap.number_of_nodes()
# >>> 511

argmap.number_of_edges()
# >>> 510

next(iter(argmap.nodes.data()))[1]
# >>> {'claim': 'Governments should provide substantial financial
# >>> incentives to families with children to counteract declining
# >>> population growth and mitigate the long-term consequences on
# >>> societal stability and progress.',
# >>>  'label': 'Pay to Populate'}

Let me show you a randomly sampled subgraph from a syncIALO debate in the train split, rendered as, and with Argdown:

[Learning Over Leisure]: Schools should restrict students' access to fan fiction and social media to protect the integrity of education. 
    <- <Restriction Infringes on Freedom of Expression>: Restricting access to fan fiction and social media unconstitutionally limits students' right to freedom of expression and stifles their creativity.
        <+ <Lifelong Learning>: By exercising their freedom of expression, students develop essential skills in critical thinking, problem-solving, and effective communication, preparing them for success in their future careers and personal lives.
        <- <Echo Chamber Effect>: Exercising freedom of expression in an unstructured environment can create an echo chamber where students only communicate with like-minded individuals, failing to develop the skills to engage with diverse perspectives and opposing views.
            <- <Silent Observer>: Developing skills to engage with diverse perspectives and opposing views is not essential for effective communication in situations where listening and observing, rather than actively engaging, is the most effective strategy.
        <- <Fan Fiction Distortion>: Fan fiction and social media often distort students' creativity by promoting unoriginal and copyrighted content, rather than fostering genuine artistic expression.
            <- <Artistic Evolution>: The value of artistic expression lies in its ability to evoke emotions and spark new ideas, regardless of whether it is original or builds upon existing works, making the distinction between original and unoriginal content irrelevant.
        <+ <Innovation Incubator>: Unrestricted freedom of expression enables students to develop critical thinking, problem-solving, and communication skills, essential for academic and professional success.
    <+ <Focus on Fundamentals-1>: Restricting access to fan fiction and social media in schools allows students to prioritize core academic subjects and develop a solid foundation in STEM fields, literature, and critical thinking.
    <+ <Focus on Fundamentals-2>: By limiting access to non-academic online content, schools can redirect students' attention to foundational subjects, fostering a stronger understanding of complex concepts and better retention of critical information.
        <+ <Knowledge Pyramid>: A strong grasp of foundational subjects allows students to recognize relationships between different ideas and concepts, creating a hierarchical structure of knowledge that enhances retention and recall of critical information.

What can I do with it?

Raw syncIALO is great for "distilling" more specific datasets.

You can use syncIALO to build datasets for pretraining, SFT, DPO or RLVR.
You can create challenging benchmarks to probe reasoning skills of LLMs.
You can create tailored few-shot examples for generating argument maps with LLMs.
You can use syncIALO data as seeds for multi-agent deliberation and personalization of LLMs.

For any of this, you will have to transform the syncIALO debates and distill more specific tasks.

To start with, you could sample submaps and simply verbalize them as dialogues, which then serve as training texts... But we can do better by exploiting the rich information contained in syncIALO.

This recipe for creating a reasoning task describes a more interesting distillation procedure:

Sample a submap (serves as answer).
Process or distort submap (serves as input_args).
Ask the model to create an argument map given input_args (serves as prompt).

For example:

Prompt	Answer
Here's a list of statements ... Reconstruct these as an argument map!	Argument map
Consider these three maps ... Merge them into a single argument map!	Argument map
Here's a flawed reconstruction ... Revise and improve!	Argument map

The deep-argmap-conversations dataset has been distilled accordingly and illustrates further argument mapping tasks that can be created from raw syncIALO.

Similarly, you can distill DPO data:

Prompt	Chosen	Rejected
Here's a list of statements ... Reconstruct these as an argument map!	Argument map	Shuffled argument map
...	...	...

If you instruct the LLM to generate argument maps in parsable format (like yaml, mermaid or Argdown), you have sheer infinite possibilities to verify solutions and create RLVR data:

Prompt	Reward
Here's a list of claims ... Reconstruct these as a yaml argument map!	Valid yaml?
Here's a list of claims ... Reconstruct these as an argument map with k nodes!	Valid yaml with k nodes?
...	...

(Granted, syncIALO is not strictly necessary for such RLVR training, which might nonetheless profit from diverse and well-designed syncIALO prompts.)

Moreover, multiple choice tasks can be easily created like so:

Prompt	Options
Consider this argument map ... What is x (SOME_GRAPH_PROPERTY)?	x=a, x=b ...
Here's a list of statements ... Which map adequately captures the argumentation?	a) argument map, b) shuffled map ...
...	...

This is cool for improving CoT / reasoning quality with RL and verifiable rewards, and of course for benchmarking LLMs.

But syncIALO can help during inference, too. Suppose you want your model to reconstruct a given text as a –– say: Argdown –– argument map of a certain size. If the model struggles, you can create diverse few-shot examples tailored to the problem at hand ad nauseam, and thus guide the model.

Personas datasets are helpful for increasing the diversity in synthetic datasets, for broad solution space exploration during inference, and for calibrating agentic AI systems. syncIALO can play a similar role and complement existing personas datasets: For example, one may additionally characterize a persona through a stance they adopt in a debate, or an argument they have put forth, endorsed or criticized.

So, syncIALO is really multi-purpose. Let's explore together what you can do with it!

How did you build it?

We've set up a dynamic pipeline that mimics a comprehensive argument mapping process. An LLM-based agent simulates a critical thinker who seeks, assesses and stores novel arguments.

The argument map is built recursively by adding pros and cons to the leaf nodes until a maximum depth has been reached. The AI agent identifies the premises of a target argument A before conceiving further arguments that either support or attack A. It selects candidate arguments in terms of salience and diversity, and checks for duplicates (via semantic similarity) before adding arguments to the argument map.

To increase diversity, we sample topics and motions by randomly choosing tags from a diverse tag cloud. We also let the AI critical thinker adopt a randomly drawn persona whenever it generates a new candidate argument.

The LLM-based agent is powered by different ❤️ open models, depending on the workflow step. We've used meta-llama/Llama-3.1-405B for generating and assessing arguments, and a finetuned Llama-3.1-8B model for less demanding generative tasks such as formatting. MoritzLaurer/deberta-v3-large-zeroshot-v2.0 serves as our multi-purpose classifier and we use sentence-transformers/all-MiniLM-L6-v2 to generate sentence embeddings.

The pipeline is built on top of ❤️ open source frameworks:

We're releasing the syncIALO dataset together with the identically named python package, which we have used to build the synthetic dataset.

What is the broader background?

Philosophically, syncIALO is inspired by the Rylian idea that epistemic competence is closely tied to argumentative language. One's epistemic competence consists, to a great part, in the ability to produce utterances in accordance with the norms of logical, evidential or scientific reasoning. Argument mapping and critical thinking may help one to excel in this domain. That's why they might provide useful resources for training and probing AI systems.

The wonderful kialo.com project can be credited with having solved the problem of designing intuitive yet effective online collaborative debating / argument mapping platforms. It's a pleasure to see how successful they are.

The informal argument maps amassed on the Kialo site are a gold mine for NLP researchers, AI engineers, computational sociologists, and Critical Thinking scholars. Yet, the mine is legally barred (for them): Debate data downloaded or scraped from the website may not be used for research or commercial purposes in the absence of explicit permission or license agreement.

That has been a further motivation for creating the syncIALO corpora, which may serve as a drop-in replacement for the Kialo data. (But it's clear that syncIALO is no universal substitute: A cognitive scientist, for example, who studies empirically how humans actually argue might find syncIALO of little help.)

Who's behind this?

syncIALO has been conceived and built by the DebateLab Team at KIT. You find us at HuggingFace and GitHub, or can follow our blog.

🤗 Hugging Face has sponsored the syncIALO project through inference time / compute credits. 🙏 We gratefully acknowledge the generous support. 🫶

How can I get involved?

You can help to improve syncIALO and to overcome its current limitations, for example by contributing pipelines to

check the data (argumentative relations, wording, appropriate labeling)
measure local and global diversity (claim embeddings)
spot and remove claim duplicates
build improved versions through argumentative refinement and re-wiring

You might also consider to

create new corpora (varying llms, topic tags, graph configs)
translate an existing debate corpus (we already have a pipeline for this)

Yet, most importantly, we invite you to

build with syncIALO and share your work.

Don't hesitate to reach out!

Community

mammour

1 day ago

Following your exemple :
Your FoF-2 = FoF-1, as it stand there it biases the dataset by overponderating/oversaturating the same argument as two different ones.

https://argdown.org/syntax/#equivalence-classes

They should look like that :

<Focus on Fundamentals>: Restricting access to fan fiction and social media in schools allows students to prioritize core academic subjects and develop a solid foundation in STEM fields, literature, and critical thinking.
<Focus on Fundamentals>: By limiting access to non-academic online content, schools can redirect students' attention to foundational subjects, fostering a stronger understanding of complex concepts and better retention of critical information.

leading to the following :

[Learning Over Leisure]: Schools should restrict students' access to fan fiction and social media to protect the integrity of education. 
    <- <Restriction Infringes on Freedom of Expression>: Restricting access to fan fiction and social media unconstitutionally limits students' right to freedom of expression and stifles their creativity.
        <+ <Lifelong Learning>: By exercising their freedom of expression, students develop essential skills in critical thinking, problem-solving, and effective communication, preparing them for success in their future careers and personal lives.
        <- <Echo Chamber Effect>: Exercising freedom of expression in an unstructured environment can create an echo chamber where students only communicate with like-minded individuals, failing to develop the skills to engage with diverse perspectives and opposing views.
            <- <Silent Observer>: Developing skills to engage with diverse perspectives and opposing views is not essential for effective communication in situations where listening and observing, rather than actively engaging, is the most effective strategy.
        <- <Fan Fiction Distortion>: Fan fiction and social media often distort students' creativity by promoting unoriginal and copyrighted content, rather than fostering genuine artistic expression.
            <- <Artistic Evolution>: The value of artistic expression lies in its ability to evoke emotions and spark new ideas, regardless of whether it is original or builds upon existing works, making the distinction between original and unoriginal content irrelevant.
        <+ <Innovation Incubator>: Unrestricted freedom of expression enables students to develop critical thinking, problem-solving, and communication skills, essential for academic and professional success.
    <+ <Focus on Fundamentals>: Restricting access to fan fiction and social media in schools allows students to prioritize core academic subjects and develop a solid foundation in STEM fields, literature, and critical thinking.
    <+ <Focus on Fundamentals>: By limiting access to non-academic online content, schools can redirect students' attention to foundational subjects, fostering a stronger understanding of complex concepts and better retention of critical information.
        <+ <Knowledge Pyramid>: A strong grasp of foundational subjects allows students to recognize relationships between different ideas and concepts, creating a hierarchical structure of knowledge that enhances retention and recall of critical information.

Problem solved, now we need to fix the dataset :

Pass all jsons trough :

#!/usr/bin/env python3
"""
Script to fix “almost duplicated” labels in a debate JSON.
It reads an input JSON file (with a “nodes” array where each node has a “label”),
finds labels that are very similar (according to a fuzzy–match threshold),
and then updates all such nodes to share a canonical label.
"""

import json
import sys
import logging
import argparse
from difflib import SequenceMatcher
from typing import List, Dict, Any

# Set up logging configuration
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

def similarity(a: str, b: str) -> float:
    """Return a similarity ratio between two strings (0 to 1)."""
    return SequenceMatcher(None, a, b).ratio()

def cluster_labels(labels: List[str], threshold: float = 0.90) -> Dict[str, str]:
    """
    Given a list of labels, return a dictionary mapping each label to a canonical label.
    Two labels that are at least 'threshold' similar will be treated as duplicates.
    (The first label encountered becomes the canonical version.)
    """
    canonical: Dict[str, str] = {}
    unique_labels = list(set(labels))  # unique labels in no particular order
    unique_labels.sort()  # sort for consistency

    # Build clusters by iterating over the unique labels.
    for i, label in enumerate(unique_labels):
        if label in canonical:
            continue
        canonical[label] = label  # label becomes its own canonical version
        for other_label in unique_labels[i + 1:]:
            if other_label in canonical:
                continue
            if similarity(label, other_label) >= threshold:
                canonical[other_label] = label
    return canonical

def fix_labels(data: Dict[str, Any], threshold: float = 0.90) -> Dict[str, Any]:
    """
    Given a debate JSON object (with a "nodes" key), fix labels by unifying similar ones.
    Returns the modified JSON object.
    """
    if "nodes" not in data:
        logging.error("No 'nodes' key found in JSON data.")
        return data

    nodes = data["nodes"]
    if not isinstance(nodes, list):
        logging.error("'nodes' should be a list.")
        return data

    # Extract all labels; if a node doesn't have a "label", default to an empty string.
    labels = [node.get("label", "") for node in nodes if isinstance(node, dict)]
    
    # Build mapping from each label to its canonical version.
    mapping = cluster_labels(labels, threshold=threshold)
    logging.info("Found %d unique labels; mapping to canonical labels:", len(mapping))
    for key, canonical_label in mapping.items():
        if key != canonical_label:
            logging.info("  %r --> %r", key, canonical_label)

    # Update each node's label using the mapping.
    for node in nodes:
        if isinstance(node, dict):
            original_label = node.get("label", "")
            if original_label in mapping:
                node["label"] = mapping[original_label]
    return data

def parse_args() -> argparse.Namespace:
    """Parse command-line arguments."""
    parser = argparse.ArgumentParser(
        description="Fix almost duplicated labels in a debate JSON file."
    )
    parser.add_argument("input_file", help="Path to the input JSON file.")
    parser.add_argument("output_file", help="Path where the fixed JSON will be saved.")
    parser.add_argument(
        "--threshold", type=float, default=0.90,
        help="Fuzzy matching threshold (default: 0.90)."
    )
    return parser.parse_args()

def main() -> None:
    args = parse_args()

    # Load JSON data from file with error handling.
    try:
        with open(args.input_file, "r", encoding="utf-8") as infile:
            data = json.load(infile)
    except FileNotFoundError:
        logging.error("Input file '%s' not found.", args.input_file)
        sys.exit(1)
    except json.JSONDecodeError as e:
        logging.error("Error decoding JSON from '%s': %s", args.input_file, e)
        sys.exit(1)
    except Exception as e:
        logging.error("An unexpected error occurred while reading '%s': %s", args.input_file, e)
        sys.exit(1)

    # Fix labels in the data.
    fixed_data = fix_labels(data, threshold=args.threshold)

    # Write the fixed data to the output file with error handling.
    try:
        with open(args.output_file, "w", encoding="utf-8") as outfile:
            json.dump(fixed_data, outfile, indent=2, ensure_ascii=False)
    except Exception as e:
        logging.error("An error occurred while writing to '%s': %s", args.output_file, e)
        sys.exit(1)

    logging.info("Fixed JSON written to '%s'", args.output_file)

if __name__ == "__main__":
    main()

with https://huggingface.co/datasets/DebateLabKIT/syncialo-raw/raw/main/data/synthetic_corpus-001/train/debate-train-0444/node_link_data-debate-train-0444.json

we get this stdo :

λ python fix_labels.py input.json output.json
INFO: Found 638 unique labels; mapping to canonical labels:
INFO:   'Algorithmic Bias Amplification' --> 'Algorithmic Amplification'
INFO:   'Biased Benchmarks' --> 'Biased Benchmark'
INFO:   'Crime Deterrent' --> 'Crime Deterrence'
INFO:   'Dataset Augmentation' --> 'Data Augmentation'
INFO:   'Data Deserts' --> 'Data Desert'
INFO:   'Diverse Datasets' --> 'Diverse Data Sets'
INFO:   'Surveillance Slippery Slope' --> 'Mass Surveillance Slippery Slope'
INFO:   'National Security Exemption' --> 'National Security Exception'
INFO:   'Protecting the Vulnerable:' --> 'Protecting the Vulnerable'
INFO:   'Redundant Safeguards' --> 'Redundancy Safeguard'
INFO: Fixed JSON written to 'output.json'

all you need to do is to adapt main and make a pass through. atm your dataset is bad practice.

Credits : me, argdown docs, AI for [code review] and [error handling].

ggbetz

Article author about 6 hours ago

Hi, thanks for having a closer look. Deduplicating the dataset and merging semantically identical arguments is certainly one important direction of how to improve syncialo. Thanks also for the concrete code. My worry is that the similarity measure is too simplistic and might mark semantically distinct, if not opposing arguments as identical. In any case, rather than changing this subset, I suggest you create refined version of the original dataset. :-)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote