You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

🔬 STELLA-VLM-FineBio-7B: Specialized Vision-Language Model for Biological Experiments

Model Size Base Model Paper License

🎯 Model Description

STELLA-VLM-FineBio-7B (Scientific Tool for Experiment Lab Learning and Analysis - Vision Language Model) is an advanced vision-language model specifically fine-tuned for biological and laboratory experiment analysis. Building upon the STELLA framework for multimodal molecular and materials science understanding, this model combines training from two specialized datasets:

  1. JoVE (Journal of Visualized Experiments): Laboratory protocol extraction from scientific videos
  2. FineBio: Biological experiment video protocol prediction and analysis (95,690 samples)

The model excels at understanding complex laboratory procedures, predicting experimental steps, and providing detailed protocol analysis from visual content, extending the STELLA capabilities into biological experimentation domains.

Key Capabilities

  • 🔬 Protocol Extraction: Generate detailed step-by-step laboratory protocols from videos
  • 🔮 Procedure Prediction: Predict next experimental steps based on video context
  • 🧪 Biological Expertise: Specialized understanding of biological experiments
  • 📸 Image Analysis: Comprehensive analysis of laboratory images and equipment
  • ⚠️ Error Detection: Identify experimental errors and safety concerns
  • 🛡️ Safety Assessment: Evaluate laboratory safety compliance
  • 🧬 Molecular Understanding: Leverage STELLA's molecular science capabilities

🚀 Quick Start

Installation

pip install torch transformers pillow opencv-python

Basic Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image

# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Zaixi/STELLA-VLM-FineBio-7B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

processor = AutoProcessor.from_pretrained(
    "Zaixi/STELLA-VLM-FineBio-7B",
    trust_remote_code=True
)

# Analyze laboratory image
image = Image.open("lab_experiment.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "Describe the biological experiment shown in this image and predict the next steps:"},
        {"type": "image", "image": image}
    ]
}]

# Process and generate
text_input = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text_input], images=[image], return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.7)

response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

Using STELLA-VLM Tools

The model includes specialized tools for laboratory analysis:

from stella_vlm_tool import (
    extract_protocol_from_video,
    analyze_lab_image,
    detect_experimental_errors,
    generate_safety_assessment
)

# Extract protocol from laboratory video
protocol = extract_protocol_from_video("experiment.mp4")

# Detect errors in experimental procedure
errors = detect_experimental_errors("lab_procedure.mp4")

# Generate safety assessment
safety_report = generate_safety_assessment("lab_setup.jpg")

📊 Training Details

Two-Stage Fine-tuning

Stage 1: JoVE Training

  • Dataset: Journal of Visualized Experiments videos
  • Focus: Laboratory protocol extraction
  • Task: Converting visual demonstrations to written protocols

Stage 2: FineBio Training

  • Dataset: 95,690 biological experiment videos
  • Focus: Experiment prediction and analysis
  • Task: Predicting future experimental steps from video context

Training Configuration

  • Base Model: nvidia/Cosmos-Reason1-7B
  • Architecture: Qwen2.5-VL
  • Training Method: LoRA fine-tuning followed by merge
  • Hardware: 8x NVIDIA H100 GPUs
  • Precision: bfloat16
  • Framework: Built on STELLA multimodal foundation

🔬 Example Outputs

Biological Protocol Extraction

Input: [Video of cell culture procedure]
Output:
Step 1: Prepare sterile PBS buffer at room temperature
Step 2: Remove culture medium from the T75 flask
Step 3: Add 5 mL PBS to wash cells, swirl gently
Step 4: Aspirate PBS and add 2 mL of 0.25% trypsin-EDTA
Step 5: Incubate at 37°C for 3-5 minutes
Step 6: Add 8 mL complete medium to neutralize trypsin
...

Experiment Prediction

Input: [Video showing PCR setup]
Output:
Current step: Adding primers to PCR reaction mix
Next predicted steps:
1. Add template DNA (1-2 µL)
2. Add polymerase enzyme (0.5 µL)
3. Mix gently by pipetting
4. Quick spin in microcentrifuge
5. Place in thermal cycler with appropriate program

🎯 Use Cases

  1. Laboratory Training: Generate training protocols from demonstration videos
  2. Protocol Documentation: Automatically document experimental procedures
  3. Quality Control: Verify correct experimental procedures
  4. Safety Compliance: Assess laboratory safety practices
  5. Education: Create educational content from research videos
  6. Research Acceleration: Speed up protocol development and optimization

⚡ Performance

  • Inference Speed: ~2-3 seconds per image/frame on A100
  • Memory Requirements: ~16GB VRAM for inference
  • Supported Formats: Images (JPG, PNG) and Videos (MP4, AVI)
  • Batch Processing: Supported for multiple frames/images

📝 Limitations

  • Optimized for biological/laboratory content
  • Best performance with clear, well-lit laboratory videos
  • English language only
  • May require domain expertise to validate outputs

🙏 Acknowledgments

  • NVIDIA for the Cosmos-Reason base model
  • JoVE for visualized experiment protocols
  • FineBio dataset contributors
  • STELLA framework developers for multimodal foundation
  • Open-source community for transformers library

📖 Citation

If you use this model in your research, please cite both STELLA and this work:

@article{shen2024stella,
  title={STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment},
  author={Shen, Jaewoo and Tian, Yuxuan and Liang, Heng and Jin, Zitian and Liu, Shiqi and Tegmark, Max and Jaakkola, Tommi S. and others},
  journal={arXiv preprint arXiv:2507.02004},
  year={2024},
  url={https://arxiv.org/abs/2507.02004}
}

@software{stella_vlm_finebio_2024,
  title = {STELLA-VLM-FineBio-7B: Specialized Vision-Language Model for Biological Experiments},
  author = {Zaixi Zhang},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Zaixi/STELLA-VLM-FineBio-7B}
}

📄 License

MIT License - See LICENSE file for details

🔗 Links

Downloads last month
41
Safetensors
Model size
8.29B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Zaixi/STELLA-VLM-FineBio-7B

Finetuned
(6)
this model
Quantizations
1 model