--- tags: - transformers - text-classification - russian - constructicon - nlp - linguistics base_model: intfloat/multilingual-e5-large language: - ru pipeline_tag: text-classification widget: - text: "passage: NP-Nom так и VP-Pfv[Sep]query: Петр так и замер." example_title: "Positive example" - text: "passage: NP-Nom так и VP-Pfv[Sep]query: Мы хорошо поработали." example_title: "Negative example" - text: "passage: мягко говоря, Cl[Sep]query: Мягко говоря, это была ошибка." example_title: "Positive example" --- # Russian Constructicon Classifier A binary classification model for determining whether a Russian Constructicon pattern is present in a given text example. Fine-tuned from [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) in two stages: first as a semantic model on Russian Constructicon data, then for binary classification. ## Model Details - **Base model:** intfloat/multilingual-e5-large - **Task:** Binary text classification - **Language:** Russian - **Training:** Two-stage fine-tuning on Russian Constructicon data ## Usage ### Primary Usage (RusCxnPipe Library) This model is designed for use with the [RusCxnPipe](https://github.com/Futyn-Maker/ruscxnpipe) library: ```python from ruscxnpipe import ConstructionClassifier classifier = ConstructionClassifier( model_name="Futyn-Maker/ruscxn-classifier" ) # Classify candidates (output from semantic search) queries = ["Петр так и замер."] candidates = [[{"id": "pattern1", "pattern": "NP-Nom так и VP-Pfv"}]] results = classifier.classify_candidates(queries, candidates) print(results[0][0]['is_present']) # 1 if present, 0 if absent ``` ### Direct Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model = AutoModelForSequenceClassification.from_pretrained("Futyn-Maker/ruscxn-classifier") tokenizer = AutoTokenizer.from_pretrained("Futyn-Maker/ruscxn-classifier") # Format: "passage: [pattern][Sep]query: [example]" text = "passage: NP-Nom так и VP-Pfv[Sep]query: Петр так и замер." inputs = tokenizer(text, return_tensors="pt", truncation=True) with torch.no_grad(): outputs = model(**inputs) prediction = torch.softmax(outputs.logits, dim=-1) is_present = torch.argmax(prediction, dim=-1).item() print(f"Construction present: {is_present}") # 1 = present, 0 = absent ``` ## Input Format The model expects input in the format: `"passage: [pattern][Sep]query: [example]"` - **query:** The Russian text to analyze - **passage:** The constructicon pattern to check for ## Training 1. **Stage 1:** Semantic embedding training on Russian Constructicon examples and patterns 2. **Stage 2:** Binary classification fine-tuning to predict construction presence ## Output - **Label 0:** Construction is NOT present in the text - **Label 1:** Construction IS present in the text ## Framework Versions - Transformers: 4.51.3 - PyTorch: 2.7.0+cu126 - Python: 3.10.12 ```