🔬 STELLA-VLM-FineBio-7B: Specialized Vision-Language Model for Biological Experiments

🎯 Model Description

STELLA-VLM-FineBio-7B (Scientific Tool for Experiment Lab Learning and Analysis - Vision Language Model) is an advanced vision-language model specifically fine-tuned for biological and laboratory experiment analysis. Building upon the STELLA framework for multimodal molecular and materials science understanding, this model combines training from two specialized datasets:

JoVE (Journal of Visualized Experiments): Laboratory protocol extraction from scientific videos
FineBio: Biological experiment video protocol prediction and analysis (95,690 samples)

The model excels at understanding complex laboratory procedures, predicting experimental steps, and providing detailed protocol analysis from visual content, extending the STELLA capabilities into biological experimentation domains.

Key Capabilities

🔬 Protocol Extraction: Generate detailed step-by-step laboratory protocols from videos
🔮 Procedure Prediction: Predict next experimental steps based on video context
🧪 Biological Expertise: Specialized understanding of biological experiments
📸 Image Analysis: Comprehensive analysis of laboratory images and equipment
⚠️ Error Detection: Identify experimental errors and safety concerns
🛡️ Safety Assessment: Evaluate laboratory safety compliance
🧬 Molecular Understanding: Leverage STELLA's molecular science capabilities

🚀 Quick Start

Installation

pip install torch transformers pillow opencv-python

Basic Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image

# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Zaixi/STELLA-VLM-FineBio-7B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

processor = AutoProcessor.from_pretrained(
    "Zaixi/STELLA-VLM-FineBio-7B",
    trust_remote_code=True
)

# Analyze laboratory image
image = Image.open("lab_experiment.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "Describe the biological experiment shown in this image and predict the next steps:"},
        {"type": "image", "image": image}
    ]
}]

# Process and generate
text_input = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text_input], images=[image], return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.7)

response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

Using STELLA-VLM Tools

The model includes specialized tools for laboratory analysis:

from stella_vlm_tool import (
    extract_protocol_from_video,
    analyze_lab_image,
    detect_experimental_errors,
    generate_safety_assessment
)

# Extract protocol from laboratory video
protocol = extract_protocol_from_video("experiment.mp4")

# Detect errors in experimental procedure
errors = detect_experimental_errors("lab_procedure.mp4")

# Generate safety assessment
safety_report = generate_safety_assessment("lab_setup.jpg")

📊 Training Details

Two-Stage Fine-tuning

Stage 1: JoVE Training

Dataset: Journal of Visualized Experiments videos
Focus: Laboratory protocol extraction
Task: Converting visual demonstrations to written protocols

Stage 2: FineBio Training

Dataset: 95,690 biological experiment videos
Focus: Experiment prediction and analysis
Task: Predicting future experimental steps from video context

Training Configuration

Base Model: nvidia/Cosmos-Reason1-7B
Architecture: Qwen2.5-VL
Training Method: LoRA fine-tuning followed by merge
Hardware: 8x NVIDIA H100 GPUs
Precision: bfloat16
Framework: Built on STELLA multimodal foundation

🔬 Example Outputs

Biological Protocol Extraction

Input: [Video of cell culture procedure]
Output:
Step 1: Prepare sterile PBS buffer at room temperature
Step 2: Remove culture medium from the T75 flask
Step 3: Add 5 mL PBS to wash cells, swirl gently
Step 4: Aspirate PBS and add 2 mL of 0.25% trypsin-EDTA
Step 5: Incubate at 37°C for 3-5 minutes
Step 6: Add 8 mL complete medium to neutralize trypsin
...

Experiment Prediction

Input: [Video showing PCR setup]
Output:
Current step: Adding primers to PCR reaction mix
Next predicted steps:
1. Add template DNA (1-2 µL)
2. Add polymerase enzyme (0.5 µL)
3. Mix gently by pipetting
4. Quick spin in microcentrifuge
5. Place in thermal cycler with appropriate program

🎯 Use Cases

Laboratory Training: Generate training protocols from demonstration videos
Protocol Documentation: Automatically document experimental procedures
Quality Control: Verify correct experimental procedures
Safety Compliance: Assess laboratory safety practices
Education: Create educational content from research videos
Research Acceleration: Speed up protocol development and optimization

⚡ Performance

Inference Speed: ~2-3 seconds per image/frame on A100
Memory Requirements: ~16GB VRAM for inference
Supported Formats: Images (JPG, PNG) and Videos (MP4, AVI)
Batch Processing: Supported for multiple frames/images

📝 Limitations

Optimized for biological/laboratory content
Best performance with clear, well-lit laboratory videos
English language only
May require domain expertise to validate outputs

🙏 Acknowledgments

NVIDIA for the Cosmos-Reason base model
JoVE for visualized experiment protocols
FineBio dataset contributors
STELLA framework developers for multimodal foundation
Open-source community for transformers library

📖 Citation

If you use this model in your research, please cite both STELLA and this work:

@article{shen2024stella,
  title={STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment},
  author={Shen, Jaewoo and Tian, Yuxuan and Liang, Heng and Jin, Zitian and Liu, Shiqi and Tegmark, Max and Jaakkola, Tommi S. and others},
  journal={arXiv preprint arXiv:2507.02004},
  year={2024},
  url={https://arxiv.org/abs/2507.02004}
}

@software{stella_vlm_finebio_2024,
  title = {STELLA-VLM-FineBio-7B: Specialized Vision-Language Model for Biological Experiments},
  author = {Zaixi Zhang},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Zaixi/STELLA-VLM-FineBio-7B}
}

📄 License

MIT License - See LICENSE file for details

Zaixi
/

STELLA-VLM-FineBio-7B

You need to agree to share your contact information to access this model