🔬 STELLA-VLM-FineBio-7B: Specialized Vision-Language Model for Biological Experiments
🎯 Model Description
STELLA-VLM-FineBio-7B (Scientific Tool for Experiment Lab Learning and Analysis - Vision Language Model) is an advanced vision-language model specifically fine-tuned for biological and laboratory experiment analysis. Building upon the STELLA framework for multimodal molecular and materials science understanding, this model combines training from two specialized datasets:
- JoVE (Journal of Visualized Experiments): Laboratory protocol extraction from scientific videos
- FineBio: Biological experiment video protocol prediction and analysis (95,690 samples)
The model excels at understanding complex laboratory procedures, predicting experimental steps, and providing detailed protocol analysis from visual content, extending the STELLA capabilities into biological experimentation domains.
Key Capabilities
- 🔬 Protocol Extraction: Generate detailed step-by-step laboratory protocols from videos
- 🔮 Procedure Prediction: Predict next experimental steps based on video context
- 🧪 Biological Expertise: Specialized understanding of biological experiments
- 📸 Image Analysis: Comprehensive analysis of laboratory images and equipment
- ⚠️ Error Detection: Identify experimental errors and safety concerns
- 🛡️ Safety Assessment: Evaluate laboratory safety compliance
- 🧬 Molecular Understanding: Leverage STELLA's molecular science capabilities
🚀 Quick Start
Installation
pip install torch transformers pillow opencv-python
Basic Usage
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image
# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Zaixi/STELLA-VLM-FineBio-7B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"Zaixi/STELLA-VLM-FineBio-7B",
trust_remote_code=True
)
# Analyze laboratory image
image = Image.open("lab_experiment.jpg")
messages = [{
"role": "user",
"content": [
{"type": "text", "text": "Describe the biological experiment shown in this image and predict the next steps:"},
{"type": "image", "image": image}
]
}]
# Process and generate
text_input = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text_input], images=[image], return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.7)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
Using STELLA-VLM Tools
The model includes specialized tools for laboratory analysis:
from stella_vlm_tool import (
extract_protocol_from_video,
analyze_lab_image,
detect_experimental_errors,
generate_safety_assessment
)
# Extract protocol from laboratory video
protocol = extract_protocol_from_video("experiment.mp4")
# Detect errors in experimental procedure
errors = detect_experimental_errors("lab_procedure.mp4")
# Generate safety assessment
safety_report = generate_safety_assessment("lab_setup.jpg")
📊 Training Details
Two-Stage Fine-tuning
Stage 1: JoVE Training
- Dataset: Journal of Visualized Experiments videos
- Focus: Laboratory protocol extraction
- Task: Converting visual demonstrations to written protocols
Stage 2: FineBio Training
- Dataset: 95,690 biological experiment videos
- Focus: Experiment prediction and analysis
- Task: Predicting future experimental steps from video context
Training Configuration
- Base Model: nvidia/Cosmos-Reason1-7B
- Architecture: Qwen2.5-VL
- Training Method: LoRA fine-tuning followed by merge
- Hardware: 8x NVIDIA H100 GPUs
- Precision: bfloat16
- Framework: Built on STELLA multimodal foundation
🔬 Example Outputs
Biological Protocol Extraction
Input: [Video of cell culture procedure]
Output:
Step 1: Prepare sterile PBS buffer at room temperature
Step 2: Remove culture medium from the T75 flask
Step 3: Add 5 mL PBS to wash cells, swirl gently
Step 4: Aspirate PBS and add 2 mL of 0.25% trypsin-EDTA
Step 5: Incubate at 37°C for 3-5 minutes
Step 6: Add 8 mL complete medium to neutralize trypsin
...
Experiment Prediction
Input: [Video showing PCR setup]
Output:
Current step: Adding primers to PCR reaction mix
Next predicted steps:
1. Add template DNA (1-2 µL)
2. Add polymerase enzyme (0.5 µL)
3. Mix gently by pipetting
4. Quick spin in microcentrifuge
5. Place in thermal cycler with appropriate program
🎯 Use Cases
- Laboratory Training: Generate training protocols from demonstration videos
- Protocol Documentation: Automatically document experimental procedures
- Quality Control: Verify correct experimental procedures
- Safety Compliance: Assess laboratory safety practices
- Education: Create educational content from research videos
- Research Acceleration: Speed up protocol development and optimization
⚡ Performance
- Inference Speed: ~2-3 seconds per image/frame on A100
- Memory Requirements: ~16GB VRAM for inference
- Supported Formats: Images (JPG, PNG) and Videos (MP4, AVI)
- Batch Processing: Supported for multiple frames/images
📝 Limitations
- Optimized for biological/laboratory content
- Best performance with clear, well-lit laboratory videos
- English language only
- May require domain expertise to validate outputs
🙏 Acknowledgments
- NVIDIA for the Cosmos-Reason base model
- JoVE for visualized experiment protocols
- FineBio dataset contributors
- STELLA framework developers for multimodal foundation
- Open-source community for transformers library
📖 Citation
If you use this model in your research, please cite both STELLA and this work:
@article{shen2024stella,
title={STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment},
author={Shen, Jaewoo and Tian, Yuxuan and Liang, Heng and Jin, Zitian and Liu, Shiqi and Tegmark, Max and Jaakkola, Tommi S. and others},
journal={arXiv preprint arXiv:2507.02004},
year={2024},
url={https://arxiv.org/abs/2507.02004}
}
@software{stella_vlm_finebio_2024,
title = {STELLA-VLM-FineBio-7B: Specialized Vision-Language Model for Biological Experiments},
author = {Zaixi Zhang},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/Zaixi/STELLA-VLM-FineBio-7B}
}
📄 License
MIT License - See LICENSE file for details
🔗 Links
- Downloads last month
- 41