Spaces:

Kesherat
/

blade-inspection-demo

Sleeping

App Files Files

xet

Community

Kesheratmex commited on Aug 20

Commit

98eefdf

1 Parent(s): 2cf4b9b

Add Grounding DINO zero‑shot detection fallback and logging

Browse files

Implemented a Grounding DINO fallback for zero‑shot object detection in `GPTOSSWrapper`, added detailed debug prints, updated comments, and introduced necessary imports. Updated `app.py` to use the new detection logic and added a README for vision‑model usage.

Files changed (4) hide show

README_VISION_MODELS.md +151 -0
app.py +9 -4
gptoss_wrapper.py +270 -9
requirements.txt +9 -3

README_VISION_MODELS.md ADDED Viewed

	@@ -0,0 +1,151 @@

+# 🎯 KESHERAT AI - Detección Zero-Shot con OWL-V2 + Grounding DINO
+## 🚀 **Nuevo Sistema de Detección**
+Hemos migrado de YOLO a un sistema de **detección zero-shot** que puede encontrar cualquier defecto que describas en texto, sin necesidad de entrenamiento previo.
+### **🔧 Modelos Utilizados:**
+#### **1. Grounding DINO (Primario)**
+- **Modelo**: `IDEA-Research/grounding-dino-base`
+- **Ventajas**: Excelente para detección zero-shot
+- **Uso**: Busca defectos usando descripciones en texto natural
+#### **2. OWL-V2 (Respaldo)**
+- **Modelo**: `google/owlv2-large-patch14-ensemble`
+- **Ventajas**: Robusto y confiable
+- **Uso**: Se activa si Grounding DINO falla
+#### **3. GPT Vision (Análisis)**
+- **Modelos**: GPT-4 Vision o BLIP/LLaVA
+- **Uso**: Análisis visual detallado en español
+## 🎯 **Consultas de Detección**
+El sistema busca estos defectos automáticamente:
+```python
+DEFECT_QUERIES = [
+    "crack", "grieta", "fisura",           # Grietas
+    "erosion", "erosión", "desgaste",      # Erosión
+    "dirt", "suciedad", "mancha",          # Suciedad
+    "damage", "daño", "impacto",           # Daños
+    "corrosion", "corrosión", "oxidación", # Corrosión
+    "hole", "agujero", "perforación",      # Agujeros
+    "stain", "mancha", "decoloración",     # Manchas
+    "wear", "desgaste", "deterioro",       # Desgaste
+    "lightning damage", "daño por rayo",   # Rayos
+    "bird strike", "impacto de ave"        # Impactos
+]
+```
+## 🛠️ **Configuración en HF Space**
+### **Variables de Entorno (Opcionales):**
+```bash
+# Para GPT Vision (opcional)
+HUGGINGFACE_API_TOKEN = tu_token_hf
+VISION_MODEL_ID = Salesforce/blip-image-captioning-base
+# Para OpenAI GPT-4 Vision (opcional)
+OPENAI_API_KEY = tu_openai_key
+```
+### **Dependencias Requeridas:**
+```
+transformers>=4.35.0
+torch==2.2.0
+torchvision
+accelerate
+sentencepiece
+Pillow
+```
+## 🔍 **Flujo de Trabajo**
+1. **Usuario sube imagen/video**
+2. **Grounding DINO** busca defectos usando texto
+3. **OWL-V2** (respaldo) si Grounding DINO falla
+4. **GPT Vision** analiza la imagen completa
+5. **Sistema** combina detecciones + análisis
+6. **Usuario** recibe resultado en español
+## 💡 **Ventajas del Nuevo Sistema**
+### **vs YOLO:**
+- ✅ **Zero-shot**: No necesita entrenamiento
+- ✅ **Flexible**: Busca cualquier defecto que describas
+- ✅ **Multilingüe**: Funciona en español e inglés
+- ✅ **Actualizable**: Agregar nuevos defectos es fácil
+### **Capacidades:**
+- 🔍 **Detección precisa** de defectos específicos
+- 🎯 **Búsqueda por texto** ("grieta en el borde")
+- 🌍 **Multilingüe** (español/inglés)
+- 🧠 **Análisis inteligente** con GPT
+- 📊 **Reportes detallados** en PDF/MD/JSON
+## 🚀 **Uso en HF Space**
+### **1. Subir Imagen/Video**
+- Formatos: JPG, PNG, MP4, AVI, MOV
+### **2. Detectar Defectos**
+- Click en "Detectar defectos con OWL-V2 + GPT"
+- El sistema automáticamente:
+  - Busca todos los defectos de la lista
+  - Analiza visualmente con GPT
+  - Genera reporte completo
+### **3. Ver Resultados**
+- **Imagen anotada** con detecciones marcadas
+- **Análisis de GPT** en español
+- **Reportes** descargables (PDF/MD/JSON)
+## 🔧 **Personalización**
+### **Agregar Nuevos Defectos:**
+Edita `DEFECT_QUERIES` en `app.py`:
+```python
+DEFECT_QUERIES = [
+    # Defectos existentes...
+    "nuevo_defecto", "new defect",
+    "otro_problema", "another issue"
+]
+```
+### **Ajustar Sensibilidad:**
+Modifica el threshold en la detección:
+```python
+# Más sensible (más detecciones)
+threshold = 0.05
+# Menos sensible (menos detecciones)
+threshold = 0.2
+```
+## 🎯 **Resultado Esperado**
+```markdown
+## 🔍 Análisis Visual Directo de la Pala
+**Estado General:** Bueno con mantenimiento menor requerido
+**Detecciones Automáticas:**
+- Dirt (suciedad): 2 áreas detectadas
+- Erosion (erosión): 1 área en borde de ataque
+**Análisis de GPT:**
+La superficie muestra condición general buena con dos áreas
+de acumulación de suciedad claramente visibles...
+**Recomendaciones:**
+- Limpieza programada en 2 semanas
+- Inspección de erosión en 3 meses
+```
+¡El sistema ahora es mucho más potente y flexible! 🎉

app.py CHANGED Viewed

@@ -146,15 +146,17 @@ def infer_media(media_path, conf=0.1, out_res="720p"):
         writer = None
         counts = {}
-        # Configurar OWL-V2
         try:
             GPTClass = _load_gptoss_wrapper()
             if GPTClass:
                 wrapper = GPTClass()
             else:
                 wrapper = None
         except Exception as e:
-            print(f"Error configurando OWL-V2: {e}")
             wrapper = None
         # Procesar frames con OWL-V2 (cada 30 frames para eficiencia)
@@ -226,17 +228,20 @@ def infer_media(media_path, conf=0.1, out_res="720p"):
     elif ext in [".jpg", ".jpeg", ".png", ".bmp"]:
         img = cv2.imread(media_path)
-        # Usar OWL-V2 para detección zero-shot
         try:
             GPTClass = _load_gptoss_wrapper()
             if GPTClass:
                 wrapper = GPTClass()
                 detection_result = wrapper.detect_objects_owlv2(media_path, DEFECT_QUERIES, threshold=conf)
                 detections = detection_result.get("detections", [])
             else:
                 detections = []
         except Exception as e:
-            print(f"Error en detección OWL-V2: {e}")
             detections = []
         counts = {}

         writer = None
         counts = {}
+        # Configurar modelos de detección (OWL-V2 + Grounding DINO)
         try:
             GPTClass = _load_gptoss_wrapper()
             if GPTClass:
                 wrapper = GPTClass()
+                print("Wrapper de detección configurado correctamente")
             else:
                 wrapper = None
+                print("No se pudo cargar el wrapper de detección")
         except Exception as e:
+            print(f"Error configurando modelos de detección: {e}")
             wrapper = None
         # Procesar frames con OWL-V2 (cada 30 frames para eficiencia)
     elif ext in [".jpg", ".jpeg", ".png", ".bmp"]:
         img = cv2.imread(media_path)
+        # Usar modelos de detección zero-shot (Grounding DINO + OWL-V2)
         try:
             GPTClass = _load_gptoss_wrapper()
             if GPTClass:
                 wrapper = GPTClass()
+                print(f"Iniciando detección zero-shot en imagen: {media_path}")
                 detection_result = wrapper.detect_objects_owlv2(media_path, DEFECT_QUERIES, threshold=conf)
                 detections = detection_result.get("detections", [])
+                print(f"Detecciones encontradas: {len(detections)}")
             else:
+                print("Wrapper no disponible, sin detecciones")
                 detections = []
         except Exception as e:
+            print(f"Error en detección zero-shot: {e}")
             detections = []
         counts = {}

gptoss_wrapper.py CHANGED Viewed

@@ -23,6 +23,8 @@ import os
 import time
 import requests
 import base64
 from typing import Optional
@@ -115,7 +117,8 @@ class GPTOSSWrapper:
     def detect_objects_owlv2(self, image_path: str, text_queries: list, threshold: float = 0.1) -> dict:
         """
-        Detect objects in image using OWL-V2 zero-shot detection with text queries.
         Args:
             image_path: Path to the image file
@@ -126,12 +129,24 @@ class GPTOSSWrapper:
             Dictionary with detections: {"detections": [{"label": str, "confidence": float, "bbox": [x1,y1,x2,y2]}, ...]}
         Raises:
-            RuntimeError if HF token not available or API call fails
         """
-        if not self.hf_token:
-            raise RuntimeError("OWL-V2 detection requires Hugging Face token. Set HUGGINGFACE_API_TOKEN.")
-        return self._detect_owlv2_hf(image_path, text_queries, threshold)
     def _generate_openai(self, prompt: str, max_tokens: int, temperature: float) -> str:
         if not self.openai_key:
@@ -435,6 +450,252 @@ class GPTOSSWrapper:
         except Exception as e:
             raise RuntimeError(f"Hugging Face Vision API call failed: {e}")
     def _detect_owlv2_hf(self, image_path: str, text_queries: list, threshold: float) -> dict:
         """
         Detect objects using OWL-V2 via Hugging Face Inference API.
@@ -445,12 +706,12 @@ class GPTOSSWrapper:
         except Exception as e:
             raise RuntimeError(f"Failed to read image file {image_path}: {e}")
-        # OWL-V2 model endpoint
-        owlv2_model = os.getenv("OWLV2_MODEL_ID", "google/owlv2-large-patch14-ensemble")
-        url = f"https://api-inference.huggingface.co/models/{owlv2_model}"
         headers = {"Authorization": f"Bearer {self.hf_token}"}
-        # Prepare payload for OWL-V2
         # OWL-V2 expects image as binary data and text queries as parameters
         payload = {
             "parameters": {

 import time
 import requests
 import base64
+import torch
+from PIL import Image
 from typing import Optional
     def detect_objects_owlv2(self, image_path: str, text_queries: list, threshold: float = 0.1) -> dict:
         """
+        Detect objects in image using OWL-V2 or Grounding DINO zero-shot detection with text queries.
+        Runs on HF GPU when available.
         Args:
             image_path: Path to the image file
             Dictionary with detections: {"detections": [{"label": str, "confidence": float, "bbox": [x1,y1,x2,y2]}, ...]}
         Raises:
+            RuntimeError if models not available or detection fails
         """
+        print(f"Starting zero-shot detection with {len(text_queries)} queries")
+        # Try Grounding DINO first (usually better for zero-shot), then OWL-V2 as fallback
+        try:
+            print("Attempting Grounding DINO detection...")
+            return self._detect_grounding_dino(image_path, text_queries, threshold)
+        except Exception as e:
+            print(f"Grounding DINO failed: {e}")
+            print("Falling back to OWL-V2...")
+            try:
+                return self._detect_owlv2_local(image_path, text_queries, threshold)
+            except Exception as e2:
+                print(f"OWL-V2 also failed: {e2}")
+                # Return empty detections instead of failing completely
+                print("Both models failed, returning empty detections")
+                return {"detections": []}
     def _generate_openai(self, prompt: str, max_tokens: int, temperature: float) -> str:
         if not self.openai_key:
         except Exception as e:
             raise RuntimeError(f"Hugging Face Vision API call failed: {e}")
+    def _detect_grounding_dino(self, image_path: str, text_queries: list, threshold: float) -> dict:
+        """
+        Detect objects using Grounding DINO running on HF GPU.
+        """
+        try:
+            from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
+            # Load Grounding DINO model (will use HF GPU)
+            model_id = "IDEA-Research/grounding-dino-base"
+            device = "cuda" if torch.cuda.is_available() else "cpu"
+            print(f"Loading Grounding DINO on device: {device}")
+            processor = AutoProcessor.from_pretrained(model_id)
+            model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
+            # Load image
+            image = Image.open(image_path)
+            # Prepare text queries (VERY important: lowercase + end with dot)
+            text = ". ".join([query.lower() for query in text_queries]) + "."
+            print(f"Grounding DINO text query: {text}")
+            # Process inputs
+            inputs = processor(images=image, text=text, return_tensors="pt").to(device)
+            # Run inference
+            with torch.no_grad():
+                outputs = model(**inputs)
+            # Post-process results
+            results = processor.post_process_grounded_object_detection(
+                outputs,
+                inputs.input_ids,
+                box_threshold=threshold,
+                text_threshold=0.3,
+                target_sizes=[image.size[::-1]]
+            )
+            # Convert to our format
+            detections = []
+            if results and len(results) > 0:
+                result = results[0]
+                boxes = result.get("boxes", [])
+                scores = result.get("scores", [])
+                labels = result.get("labels", [])
+                print(f"Grounding DINO found {len(boxes)} detections")
+                for box, score, label_idx in zip(boxes, scores, labels):
+                    if score >= threshold:
+                        x1, y1, x2, y2 = box.tolist()
+                        label = text_queries[label_idx] if label_idx < len(text_queries) else "unknown"
+                        detections.append({
+                            "label": label,
+                            "confidence": float(score),
+                            "bbox": [int(x1), int(y1), int(x2), int(y2)]
+                        })
+            return {"detections": detections}
+        except Exception as e:
+            raise RuntimeError(f"Grounding DINO detection failed: {e}")
+    def _detect_owlv2_local(self, image_path: str, text_queries: list, threshold: float) -> dict:
+        """
+        Detect objects using OWL-V2 running on HF GPU.
+        """
+        try:
+            from transformers import Owlv2Processor, Owlv2ForObjectDetection
+            # Load OWL-V2 model (will use HF GPU)
+            device = "cuda" if torch.cuda.is_available() else "cpu"
+            print(f"Loading OWL-V2 on device: {device}")
+            processor = Owlv2Processor.from_pretrained("google/owlv2-large-patch14-ensemble")
+            model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-large-patch14-ensemble").to(device)
+            # Load image
+            image = Image.open(image_path)
+            # Prepare text queries (format: [["query1", "query2", ...]])
+            texts = [text_queries]
+            print(f"OWL-V2 text queries: {texts}")
+            # Process inputs
+            inputs = processor(text=texts, images=image, return_tensors="pt").to(device)
+            # Run inference
+            with torch.no_grad():
+                outputs = model(**inputs)
+            # Target image sizes for rescaling
+            target_sizes = torch.Tensor([image.size[::-1]])
+            # Post-process results
+            results = processor.post_process_object_detection(
+                outputs=outputs,
+                target_sizes=target_sizes,
+                threshold=threshold
+            )
+            # Convert to our format
+            detections = []
+            if results and len(results) > 0:
+                result = results[0]
+                boxes = result.get("boxes", [])
+                scores = result.get("scores", [])
+                labels = result.get("labels", [])
+                print(f"OWL-V2 found {len(boxes)} detections")
+                for box, score, label_idx in zip(boxes, scores, labels):
+                    if score >= threshold:
+                        x1, y1, x2, y2 = box.tolist()
+                        label = text_queries[label_idx] if label_idx < len(text_queries) else "unknown"
+                        detections.append({
+                            "label": label,
+                            "confidence": float(score),
+                            "bbox": [int(x1), int(y1), int(x2), int(y2)]
+                        })
+            return {"detections": detections}
+        except Exception as e:
+            raise RuntimeError(f"OWL-V2 detection failed: {e}")
+    def _detect_grounding_dino(self, image_path: str, text_queries: list, threshold: float) -> dict:
+        """
+        Detect objects using Grounding DINO locally.
+        """
+        try:
+            from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
+            # Load Grounding DINO model
+            model_id = "IDEA-Research/grounding-dino-base"
+            device = "cuda" if torch.cuda.is_available() else "cpu"
+            processor = AutoProcessor.from_pretrained(model_id)
+            model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
+            # Load image
+            image = Image.open(image_path)
+            # Prepare text queries (VERY important: lowercase + end with dot)
+            text = ". ".join([query.lower() for query in text_queries]) + "."
+            # Process inputs
+            inputs = processor(images=image, text=text, return_tensors="pt").to(device)
+            # Run inference
+            with torch.no_grad():
+                outputs = model(**inputs)
+            # Post-process results
+            results = processor.post_process_grounded_object_detection(
+                outputs,
+                inputs.input_ids,
+                box_threshold=threshold,
+                text_threshold=0.3,
+                target_sizes=[image.size[::-1]]
+            )
+            # Convert to our format
+            detections = []
+            if results and len(results) > 0:
+                result = results[0]
+                boxes = result.get("boxes", [])
+                scores = result.get("scores", [])
+                labels = result.get("labels", [])
+                for box, score, label_idx in zip(boxes, scores, labels):
+                    if score >= threshold:
+                        x1, y1, x2, y2 = box.tolist()
+                        label = text_queries[label_idx] if label_idx < len(text_queries) else "unknown"
+                        detections.append({
+                            "label": label,
+                            "confidence": float(score),
+                            "bbox": [int(x1), int(y1), int(x2), int(y2)]
+                        })
+            return {"detections": detections}
+        except Exception as e:
+            raise RuntimeError(f"Grounding DINO detection failed: {e}")
+    def _detect_owlv2_local(self, image_path: str, text_queries: list, threshold: float) -> dict:
+        """
+        Detect objects using OWL-V2 locally.
+        """
+        try:
+            from transformers import Owlv2Processor, Owlv2ForObjectDetection
+            # Load OWL-V2 model
+            processor = Owlv2Processor.from_pretrained("google/owlv2-large-patch14-ensemble")
+            model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-large-patch14-ensemble")
+            # Load image
+            image = Image.open(image_path)
+            # Prepare text queries (format: [["query1", "query2", ...]])
+            texts = [text_queries]
+            # Process inputs
+            inputs = processor(text=texts, images=image, return_tensors="pt")
+            # Run inference
+            with torch.no_grad():
+                outputs = model(**inputs)
+            # Target image sizes for rescaling
+            target_sizes = torch.Tensor([image.size[::-1]])
+            # Post-process results
+            results = processor.post_process_object_detection(
+                outputs=outputs,
+                target_sizes=target_sizes,
+                threshold=threshold
+            )
+            # Convert to our format
+            detections = []
+            if results and len(results) > 0:
+                result = results[0]
+                boxes = result.get("boxes", [])
+                scores = result.get("scores", [])
+                labels = result.get("labels", [])
+                for box, score, label_idx in zip(boxes, scores, labels):
+                    if score >= threshold:
+                        x1, y1, x2, y2 = box.tolist()
+                        label = text_queries[label_idx] if label_idx < len(text_queries) else "unknown"
+                        detections.append({
+                            "label": label,
+                            "confidence": float(score),
+                            "bbox": [int(x1), int(y1), int(x2), int(y2)]
+                        })
+            return {"detections": detections}
+        except Exception as e:
+            raise RuntimeError(f"OWL-V2 detection failed: {e}")
     def _detect_owlv2_hf(self, image_path: str, text_queries: list, threshold: float) -> dict:
         """
         Detect objects using OWL-V2 via Hugging Face Inference API.
         except Exception as e:
             raise RuntimeError(f"Failed to read image file {image_path}: {e}")
+        # DETR model endpoint (object detection)
+        detr_model = os.getenv("DETR_MODEL_ID", "facebook/detr-resnet-101")
+        url = f"https://api-inference.huggingface.co/models/{detr_model}"
         headers = {"Authorization": f"Bearer {self.hf_token}"}
+        # Prepare payload for DETR
         # OWL-V2 expects image as binary data and text queries as parameters
         payload = {
             "parameters": {

requirements.txt CHANGED Viewed

@@ -1,9 +1,15 @@
-ultralytics==8.2.0
 gradio==4.36.1        # 4.x permite auth=
 opencv-python-headless
 reportlab==3.6.13
 requests              # For GPT-OSS API calls
 # Fijar NumPy 1.x para compatibilidad con PyTorch 2.2 en ZeroGPU
 numpy==1.26.4
-# PyTorch compatible con ZeroGPU (se instala CUDA apropiada en el runtime)
-torch==2.2.0

+# Vision models for zero-shot detection
+transformers>=4.35.0
+torch==2.2.0
+torchvision
+# UI and processing
 gradio==4.36.1        # 4.x permite auth=
 opencv-python-headless
 reportlab==3.6.13
 requests              # For GPT-OSS API calls
+Pillow
 # Fijar NumPy 1.x para compatibilidad con PyTorch 2.2 en ZeroGPU
 numpy==1.26.4
+# Additional dependencies for vision models
+accelerate
+sentencepiece

**Add Grounding DINO zero‑shot detection fallback and logging**

Add Grounding DINO zero‑shot detection fallback and logging