Fine-tuned ViT for House Condition Classification

This model is a fine-tuned version of google/vit-base-patch16-224-in21k for classifying house conditions into 4 categories.

Model Description

This Vision Transformer (ViT) model has been fine-tuned to classify house images into four condition categories:

good (dobre)
unknown (nepoznato)
ruined (oronule)
medium (srednje)

Training Details

Training Data

Total dataset: 935 images
Training set: 776 images
Validation set: 80 images
Test set: 79 images
Classes: 4 (dobre, nepoznato, oronule, srednje)

Training Hyperparameters

Epochs: 10.0
Batch size: 16 per device
Learning rate: 2e-5
Optimizer: AdamW
Seed: 42 (for reproducibility)
Training time: 5m 45s
Samples per second: 22.43

Evaluation Results

Validation Set Performance

Accuracy: 81.2%
Loss: 0.5629

Training Set Performance

Final Training Loss: 0.5295

Per-Class Metrics (Validation)

Class	Precision	Recall	F1-Score	Support
good	0.78	0.70	0.74	10
unknown	1.00	0.83	0.91	24
ruined	0.62	1.00	0.77	15
medium	0.85	0.74	0.79	31

Overall Metrics:

Accuracy: 81.0% (65/80 correct)
Macro Average: Precision=0.81, Recall=0.82, F1=0.80
Weighted Average: Precision=0.84, Recall=0.81, F1=0.82

Confusion Matrix (Validation)

              Predicted →
           good  unknown  ruined  medium
good       [  7      0      0      3 ]
unknown    [  1     20      2      1 ]
ruined     [  0      0     15      0 ]
medium     [  1      0      7     23 ]

Key Insights:

'unknown' class has perfect precision (1.00) - no false positives
'ruined' class has perfect recall (1.00) - catches all ruined houses
Main confusion: 'medium' condition sometimes mistaken for 'ruined' (7 cases)
'good' houses occasionally misclassified as 'medium' (3 cases)

Usage

from transformers import ViTForImageClassification, ViTImageProcessor
from PIL import Image
import torch

# Load model and processor
model = ViTForImageClassification.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME")
processor = ViTImageProcessor.from_pretrained("YOUR_USERNAME/YOUR_MODEL_NAME")

# Load and preprocess image
image = Image.open("path_to_image.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt")

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)

predicted_class_idx = outputs.logits.argmax(-1).item()
predicted_label = model.config.id2label[str(predicted_class_idx)]

print(f"Predicted class: {predicted_label}")

# Get probabilities
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]
for idx, prob in enumerate(probs):
    label = model.config.id2label[str(idx)]
    print(f"{label}: {prob.item():.2%}")

Limitations and Bias

The model was trained on a specific dataset of house images and may not generalize well to different architectural styles or regions
Performance varies by class - see validation metrics for details
The model may have difficulty distinguishing between similar condition categories
Dataset size: 935 images (relatively small for deep learning)
Images are from a specific geographical/architectural context

Training Procedure

The model was fine-tuned using the Hugging Face Transformers library with the following approach:

Pre-trained weights: Initialized from google/vit-base-patch16-224-in21k
Classification head: Replaced with a new 4-class classifier
Fine-tuning: All model parameters were fine-tuned on the custom dataset
Data preprocessing: Images converted to RGB to ensure consistent 3-channel input
Evaluation strategy: Evaluated every 50 steps with checkpoint saving
Best model selection: Best model automatically loaded based on validation performance

Base Model

google/vit-base-patch16-224-in21k

Vision Transformer (ViT) model pre-trained on ImageNet-21k at resolution 224x224.

Framework Versions

Transformers: 4.57.1
PyTorch: 2.x
Datasets: 3.x
Python: 3.13

Citation

If you use this model, please cite:

@misc{house-condition-vit,
  author = {Your Name},
  title = {Fine-tuned ViT for House Condition Classification},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/YOUR_MODEL_NAME}}
}

Model Card Authors

This model card was created by the model author.

Additional Information

Repository: [GitHub Repository URL]
Contact: [Your Email or Contact]

Downloads last month: 162

Safetensors

Model size

85.8M params

Tensor type

F32