Mistral-7B-Instruct-v0.3 Claims Extraction Model

Model Description

This model is a fine-tuned version of Mistral-7B-Instruct-v0.3, specifically designed for the task of claims extraction from Reddit comments. It has been fine-tuned to identify and rephrase claims within a given comment, making them self-contained and understandable without needing additional context. This model is part of SAGESSE, a pipeline for processing data from Reddit threads to create argument maps.

Training and Fine-Tuning

Base Model: Mistral-7B-Instruct-v0.3
Training Data:
- Claims Extraction Dataset: 981 annotated comments from political domain.
- Annotation Source: Annotated using GPT-4-turbo with few-shot prompting.
Fine-Tuning Approach:
- Technique: Parameter Efficient Fine-Tuning (PEFT) using Low-Rank Adaptation (LoRA).
- Training Configuration:
  - Epochs: 2
  - Learning Rate: 2e-4
  - Batch Size: 1
  - Gradient Accumulation Steps: 8
  - Hardware: Single Nvidia A100 GPU with 40GB memory
  - LoRA Rank: 16
  - LoRA Alpha: 16
Libraries Used: HuggingFace’s Transformers library

Performance

Evaluation Metric: BERTScore (Precision, Recall, F1)
Performance:
- Precision: 0.874
- Recall: 0.874
- F1 Score: 0.874
Efficiency:
- Average Inference Time: 0.32 seconds per comment

Usage

This model can be used to extract and rephrase claims from comments in Reddit threads. It is particularly useful for creating argument maps and understanding debates in online forums.

Input Format

The model was used with the following prompt template:

### Instruction:
You are a helpful assistant for claim extraction from Reddit comments. A claim is a statement that expresses a point of view or a conclusion
that a speaker believes is true, it can be presented as a pure opinion or supported by evidence. You'll get a Reddit comment as input text,
and you have to extract all the claims that the comment contains. Together with the comment, you will also have the submission that opened
the thread and the parent comment (if different from the submission), to provide you with more context, but you should focus on the comment
itself to extract the claims and make sure that the extracted claim conveys the same opinion expressed in the comment.
If the claim is referring to something or someone implicitly, you should use the information from the comment and its context to rephrase
the claim in a more explicit and clear way, so that one can understand it as it is, without needing additional context. Make sure to remove
from the extracted claims any irrelevant information, like greetings or expressions like 'I think' or 'I believe that'.
You should put the output in a JSON object with the key 'claims' referring to a list of strings, each string being a claim extracted from the input text.
The list should be empty if you haven't found any claim in the input text.

### Context:
Submission:
<submission>

Parent comment:
<parent-comment>

### Comment:
<input-comment>

### Response:

Output Format

The model outputs a JSON object containing the extracted claims:

{
  "claims": [
    "Extracted claim 1",
    "Extracted claim 2",
    ...
  ]
}

Limitations

Context Dependency: The model performs best when provided with sufficient context to resolve implicit subjects.
Model Hallucination: There may be instances where the model rephrases claims incorrectly, particularly with sarcastic or highly nuanced comments.
Dataset Specificity: The model is trained on political discussions from Reddit and may not generalize well to other domains.

Citation

If you use this model in your research, please cite the following paper:

TBD

Contact

For questions or issues, please contact Matteo Santelmo at matteo.santelmo@epfl.ch.