Dripper(MinerU-HTML)
Dripper(MinerU-HTML) is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation.
Features
- 🚀 LLM-Powered Extraction: Uses state-of-the-art language models to intelligently identify main content
- 🎯 State Machine Guidance: Implements logits processing with state machines for structured JSON output
- 🔄 Fallback Mechanism: Automatically falls back to alternative extraction methods on errors
- 📊 Comprehensive Evaluation: Built-in evaluation framework with ROUGE and item-level metrics
- 🌐 REST API Server: FastAPI-based server for easy integration
- ⚡ Distributed Processing: Ray-based parallel processing for large-scale evaluation
- 🔧 Multiple Extractors: Supports various baseline extractors for comparison
Installation
Prerequisites
- Python >= 3.10
- CUDA-capable GPU (recommended for LLM inference)
- Sufficient memory for model loading
Install from Source
The installation process automatically handles dependencies. The setup.py reads dependencies from requirements.txt and optionally from baselines.txt.
Basic Installation (Core Functionality)
For basic usage of Dripper, install with core dependencies only:
# Clone the repository
git clone https://github.com/opendatalab/MinerU-HTML
cd MinerU-HTML
# Install the package with core dependencies only
# Dependencies from requirements.txt are automatically installed
pip install .
Installation with Baseline Extractors (for Evaluation)
If you need to run baseline evaluations and comparisons, install with the baselines extra:
# Install with baseline extractor dependencies
pip install -e .[baselines]
This will install additional libraries required for baseline extractors:
readabilipy,readability_lxml- Readability-based extractorsresiliparse- Resilient HTML parsingjustext- JustText extractorgne- General News Extractorgoose3- Goose3 article extractorboilerpy3- Boilerplate removalcrawl4ai- AI-powered web content extraction
Note: The baseline extractors are only needed for running comparative evaluations. For basic usage of Dripper, the core installation is sufficient.
Quick Start
1. Download the model
visit our model at MinerU-HTML and download the model, you can use the following command to download the model:
huggingface-cli download opendatalab/MinerU-HTML
2. Using the Python API
from dripper.api import Dripper
# Initialize Dripper with model configuration
dripper = Dripper(
config={
'model_path': '/path/to/your/model',
'tp': 1, # Tensor parallel size
'state_machine': None, # or 'v1', or 'v2
'use_fall_back': True,
'raise_errors': False,
}
)
# Extract main content from HTML
html_content = "<html>...</html>"
result = dripper.process(html_content)
# Access results
main_html = result[0].main_html
3. Using the REST API Server
# Start the server
python -m dripper.server \
--model_path /path/to/your/model \
--state_machine v2 \
--port 7986
# Or use environment variables
export DRIPPER_MODEL_PATH=/path/to/your/model
export DRIPPER_STATE_MACHINE=v2
export DRIPPER_PORT=7986
python -m dripper.server
Then make requests to the API:
# Extract main content
curl -X POST "http://localhost:7986/extract" \
-H "Content-Type: application/json" \
-d '{"html": "<html>...</html>", "url": "https://example.com"}'
# Health check
curl http://localhost:7986/health
Configuration
Dripper Configuration Options
| Parameter | Type | Default | Description |
|---|---|---|---|
model_path |
str | Required | Path to the LLM model directory |
tp |
int | 1 | Tensor parallel size for model inference |
state_machine |
str | None | State machine version: 'v1', 'v2', or None |
use_fall_back |
bool | True | Enable fallback to trafilatura on errors |
raise_errors |
bool | False | Raise exceptions on errors (vs returning None) |
debug |
bool | False | Enable debug logging |
early_load |
bool | False | Load model during initialization |
Environment Variables
DRIPPER_MODEL_PATH: Path to the LLM modelDRIPPER_STATE_MACHINE: State machine version (v1,v2, or empty)DRIPPER_PORT: Server port number (default: 7986)VLLM_USE_V1: Must be set to'0'when using state machine
Usage Examples
Batch Processing
from dripper.api import Dripper
dripper = Dripper(config={'model_path': '/path/to/model'})
# Process multiple HTML strings
html_list = ["<html>...</html>", "<html>...</html>"]
results = dripper.process(html_list)
for result in results:
print(result.main_html)
Evaluation
Baseline Evaluation
python app/eval_baseline.py \
--bench /path/to/benchmark.jsonl \
--task_dir /path/to/output \
--extractor_name dripper-md \
--default_config gpu \
--model_path /path/to/model
Two-Step Evaluation
# if inferencen with no state machine, set VLLM_USE_V1=1
export VLLM_USE_V1=1
# if use state machine, set VLLM_USE_V1=0
# export VLLM_USE_V1=0
RESULT_PATH=/path/to/output
EXP_NAME=MinerU-HTML
MODEL_PATH=/path/to/model
BENCH_DATA=/path/to/benchmark.jsonl
# Step 1: Prepare for evaluation
python app/eval_with_answer.py \
--bench $BENCH_DATA \
--task_dir $RESULT_PATH/$MODEL_NAME \
--step 1 --cpus 128 --force_update
# Step 2: Run inference
python app/run_inference.py \
--task_dir $RESULT_PATH/$MODEL_NAME \
--model_path $MODEL_PATH \
--output_path $RESULT_PATH/$MODEL_NAME/res.jsonl \
--no_logits
# Step 3: process results
python app/process_res.py \
--response $RESULT_PATH/$MODEL_NAME/res.jsonl \
--answer $RESULT_PATH/$MODEL_NAME/ans.jsonl \
--error $RESULT_PATH/$MODEL_NAME/err.jsonl
# Step 4: Evaluate with answers
python app/eval_with_answer.py \
--bench $BENCH_DATA \
--task_dir $RESULT_PATH/$MODEL_NAME \
--answer $RESULT_PATH/$MODEL_NAME/ans.jsonl \
--step 2 --cpus 128 --force_update
MinerU Ecosystem & Cloud API (No GPU Required)
MinerU-HTML is part of the broader MinerU Ecosystem. If you don't have local GPU resources, or if you want to seamlessly integrate HTML/PDF/Document extraction into your existing workflows, you can use our official Cloud API, SDKs, and RAG integrations.
Command Line API
Show commands
# Windows (PowerShell)
irm https://cdn-mineru.openxlab.org.cn/open-api-cli/install.ps1 | iex
# macOS / Linux
curl -fsSL https://cdn-mineru.openxlab.org.cn/open-api-cli/install.sh | sh
# Precision extract — token required
mineru-open-api auth
mineru-open-api extract webpage.html -o ./output/ # local file
mineru-open-api crawl https://mineru.net/apiManage/docs -o ./output/ # crawl from URL
Python SDK
Show code
# pip install mineru-open-sdk
from mineru import MinerU
# Precision mode — tables, formulas, large files
client = MinerU("your-token") # https://mineru.net/apiManage/token
result = client.extract("webpage.html") # local file
result = client.crawl("https://mineru.net/apiManage/docs") # crawl from URL
print(result.markdown)
RAG — LangChain
Show code
# pip install langchain-mineru
from langchain_mineru import MinerULoader
# Precision mode — full RAG pipeline
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
docs = MinerULoader(source="article.html", mode="precision", token="your-token",
formula=True, table=True).load()
chunks = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200).split_documents(docs)
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
results = vectorstore.similarity_search("key requirements", k=3)
RAG — LlamaIndex
llama-index-readers-mineru is an official LlamaIndex Reader supporting multi-format document extraction.
Show code
# pip install llama-index-readers-mineru
from llama_index.readers.mineru import MinerUReader
# Precision mode — OCR, formula, table
docs = MinerUReader(mode="precision", token="your-token",
ocr=True, formula=True, table=True).load_data("complex_article.html")
# Full RAG pipeline
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(docs)
response = index.as_query_engine().query("Summarize the key content of this page")
print(response)
MCP Server (Claude Desktop · Cursor · Windsurf)
mineru-open-mcp lets any MCP-compatible AI client parse web pages and documents as a native tool.
Show config
{
"mcpServers": {
"mineru": {
"command": "uvx",
"args": ["mineru-open-mcp"],
"env": { "MINERU_API_TOKEN": "your-token" }
}
}
}
Project Structure
Dripper/
├── dripper/ # Main package
│ ├── api.py # Dripper API class
│ ├── server.py # FastAPI server
│ ├── base.py # Core data structures
│ ├── exceptions.py # Custom exceptions
│ ├── inference/ # LLM inference modules
│ │ ├── inference.py # Generation functions
│ │ ├── prompt.py # Prompt generation
│ │ ├── logits.py # Response parsing
│ │ └── logtis_processor/ # State machine logits processors
│ ├── process/ # HTML processing
│ │ ├── simplify_html.py
│ │ ├── map_to_main.py
│ │ └── html_utils.py
│ ├── eval/ # Evaluation modules
│ │ ├── metric.py # ROUGE and item-level metrics
│ │ ├── eval.py # Evaluation functions
│ │ ├── process.py # Processing utilities
│ │ └── benckmark.py # Benchmark data structures
│ └── eval_baselines/ # Baseline extractors
│ ├── base.py # Evaluation framework
│ └── baselines/ # Extractor implementations
├── app/ # Application scripts
│ ├── eval_baseline.py # Baseline evaluation script
│ ├── eval_with_answer.py # Two-step evaluation
│ ├── run_inference.py # Inference script
│ └── process_res.py # Result processing
├── requirements.txt # Core Python dependencies (auto-installed)
├── baselines.txt # Optional dependencies for baseline extractors
├── LICENCE # Apache License 2.0
├── NOTICE # Copyright and attribution notices
└── setup.py # Package setup (handles dependency installation)
Supported Extractors
Dripper supports various baseline extractors for comparison:
- Dripper (
dripper-md,dripper-html): The main LLM-based extractor - Trafilatura: Fast and accurate content extraction
- Readability: Mozilla's readability algorithm
- BoilerPy3: Python port of Boilerpipe
- NewsPlease: News article extractor
- Goose3: Article extractor
- GNE: General News Extractor
- Crawl4ai: AI-powered web content extraction
- And more...
Evaluation Metrics
- ROUGE Scores: ROUGE-N precision, recall, and F1 scores
- Item-Level Metrics: Per-tag-type (main/other) precision, recall, F1, and accuracy
- HTML Output: Extracted main HTML for visual inspection
Development
Running Tests
# Add test commands here when available
Code Style
The project uses pre-commit hooks for code quality. Install them:
pre-commit install
Troubleshooting
Common Issues
VLLM_USE_V1 Error: When using state machine, ensure
VLLM_USE_V1=0is set:export VLLM_USE_V1=0Model Loading Errors: Verify model path and ensure sufficient GPU memory
Import Errors: Ensure the package is properly installed:
# Reinstall the package (this will automatically install dependencies from requirements.txt) pip install -e . # If you need baseline extractors for evaluation: pip install -e .[baselines]
License
This project is licensed under the Apache License, Version 2.0. See the LICENCE file for details.
Copyright Notice
This project contains code and model weights derived from Qwen3. Original Qwen3 Copyright 2024 Alibaba Cloud, licensed under Apache License 2.0. Modifications and additional training Copyright 2025 OpenDatalab Shanghai AILab, licensed under Apache License 2.0.
For more information, please see the NOTICE file.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments
- Built on top of vLLM for efficient LLM inference
- Uses Trafilatura for fallback extraction
- Finetuned on Qwen3
- Inspired by various HTML content extraction research
- Downloads last month
- 715