Fine-tuned ViT for House Condition Classification

This model is a fine-tuned version of google/vit-base-patch16-224-in21k for classifying house conditions into 4 categories.

Model Description

This Vision Transformer (ViT) model has been fine-tuned to classify house images into four condition categories:

  • good (dobre)
  • unknown (nepoznato)
  • ruined (oronule)
  • medium (srednje)

Training Details

Training Data

  • Total dataset: 935 images
  • Training set: 776 images
  • Validation set: 80 images
  • Test set: 79 images
  • Classes: 4 (dobre, nepoznato, oronule, srednje)

Training Hyperparameters

  • Epochs: 10.0
  • Batch size: 16 per device
  • Learning rate: 2e-5
  • Optimizer: AdamW
  • Seed: 42 (for reproducibility)
  • Training time: 5m 45s
  • Samples per second: 22.43

Evaluation Results

Validation Set Performance

  • Accuracy: 81.2%
  • Loss: 0.5629

Training Set Performance

  • Final Training Loss: 0.5295

Per-Class Metrics (Validation)

Class Precision Recall F1-Score Support
good 0.78 0.70 0.74 10
unknown 1.00 0.83 0.91 24
ruined 0.62 1.00 0.77 15
medium 0.85 0.74 0.79 31

Overall Metrics:

  • Accuracy: 81.0% (65/80 correct)
  • Macro Average: Precision=0.81, Recall=0.82, F1=0.80
  • Weighted Average: Precision=0.84, Recall=0.81, F1=0.82

Confusion Matrix (Validation)

              Predicted โ†’
           good  unknown  ruined  medium
good       [  7      0      0      3 ]
unknown    [  1     20      2      1 ]
ruined     [  0      0     15      0 ]
medium     [  1      0      7     23 ]

Key Insights:

  • 'unknown' class has perfect precision (1.00) - no false positives
  • 'ruined' class has perfect recall (1.00) - catches all ruined houses
  • Main confusion: 'medium' condition sometimes mistaken for 'ruined' (7 cases)
  • 'good' houses occasionally misclassified as 'medium' (3 cases)

Usage

from transformers import ViTForImageClassification, ViTImageProcessor
from PIL import Image
import torch

# Load model and processor
model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME")
processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME")

# Load and preprocess image
image = Image.open("path_to_image.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt")

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)

predicted_class_idx = outputs.logits.argmax(-1).item()
predicted_label = model.config.id2label[str(predicted_class_idx)]

print(f"Predicted class: {predicted_label}")

# Get probabilities
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]
for idx, prob in enumerate(probs):
    label = model.config.id2label[str(idx)]
    print(f"{label}: {prob.item():.2%}")

Limitations and Bias

  • The model was trained on a specific dataset of house images and may not generalize well to different architectural styles or regions
  • Performance varies by class - see validation metrics for details
  • The model may have difficulty distinguishing between similar condition categories
  • Dataset size: 935 images (relatively small for deep learning)
  • Images are from a specific geographical/architectural context

Training Procedure

The model was fine-tuned using the Hugging Face Transformers library with the following approach:

  1. Pre-trained weights: Initialized from google/vit-base-patch16-224-in21k
  2. Classification head: Replaced with a new 4-class classifier
  3. Fine-tuning: All model parameters were fine-tuned on the custom dataset
  4. Data preprocessing: Images converted to RGB to ensure consistent 3-channel input
  5. Evaluation strategy: Evaluated every 50 steps with checkpoint saving
  6. Best model selection: Best model automatically loaded based on validation performance

Base Model

google/vit-base-patch16-224-in21k

Vision Transformer (ViT) model pre-trained on ImageNet-21k at resolution 224x224.

Framework Versions

  • Transformers: 4.57.1
  • PyTorch: 2.x
  • Datasets: 3.x
  • Python: 3.13

Citation

If you use this model, please cite:

@misc{house-condition-vit,
  author = {Your Name},
  title = {Fine-tuned ViT for House Condition Classification},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/YOUR_MODEL_NAME}}
}

Model Card Authors

This model card was created by the model author.

Additional Information

  • Repository: [GitHub Repository URL]
  • Contact: [Your Email or Contact]
Downloads last month
162
Safetensors
Model size
85.8M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support