Spaces:

app-x
/

chinese_spam_classifier

Sleeping

File size: 3,866 Bytes

b190df6
eb36871
ffe88ec
6f39ab6
eb36871
73ef696
2cdac3e
ffe88ec
 
 
 
b190df6
6f39ab6
 
73ef696
 
 
 
 
6f39ab6
eb36871
 
73ef696
 
eb36871
 
 
 
2cdac3e
eb36871
2cdac3e
ffe88ec
6f39ab6
2cdac3e
ffe88ec
 
 
eb36871
 
 
583a758
eb36871
 
 
 
 
 
583a758
eb36871
 
 
 
 
 
 
ffe88ec
 
 
 
32f9de4
ffe88ec
 
eb36871
8bb6322
eb36871
 
32f9de4
eb36871
 
 
 
 
8bb6322
 
ffe88ec
 
 
 
 
 
 
 
6f39ab6
 
ffe88ec
 
 
 
6f39ab6
 
 
 
ffe88ec
 
 
 
 
 
 
 
 
 
eb36871

import streamlit as st
from transformers import pipeline, AutoTokenizer
import json
from onnxruntime import InferenceSession
import numpy as np
from huggingface_hub import hf_hub_download

# 设置页面配置
st.set_page_config(page_title="中文垃圾信息分类器", page_icon="🚫", layout="wide")

# 加载中文垃圾信息分类器
@st.cache_resource
def load_classifiers():
    hf_classifier = pipeline("text-classification", model="app-x/chinese_spam_classifier")
    
    # 从Hugging Face Hub下载ONNX模型
    onnx_path = hf_hub_download("app-x/chinese_spam_classifier_onnx", "model_optimized.onnx")
    onnx_session = InferenceSession(onnx_path)
    
    tokenizer = AutoTokenizer.from_pretrained("app-x/chinese_spam_classifier_onnx")
    
    # 加载配置文件
    config_path = hf_hub_download("app-x/chinese_spam_classifier_onnx", "config.json")
    with open(config_path, "r") as f:
        config = json.load(f)
    
    id2label = config["id2label"]
    return hf_classifier, onnx_session, tokenizer, id2label

hf_classifier, onnx_session, tokenizer, id2label = load_classifiers()

st.title("🚫 中文垃圾信息分类器")
st.write("使用两个模型进行中文文本的垃圾信息分类。")

# 创建两列布局
col1, col2 = st.columns([2, 1])

def classify_text(text):
    # HuggingFace模型分类
    hf_result = hf_classifier(text)[0]
    hf_label = hf_result["label"]
    hf_confidence = hf_result["score"]
    
    # ONNX模型分类
    inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)
    onnx_result = onnx_session.run(None, dict(inputs))
    onnx_probs = onnx_result[0][0]
    onnx_label = "spam" if onnx_probs[1] > onnx_probs[0] else "normal"
    onnx_confidence = np.max(onnx_probs)
    
    return {
        "hf": {"label": hf_label, "confidence": hf_confidence},
        "onnx": {"label": onnx_label, "confidence": float(onnx_confidence)}
    }

with col1:
    # 创建文本输入框
    text_input = st.text_area("请输入中文文本:", height=200)
    
    if st.button("分析", key="classify_button"):
        if text_input:
            with st.spinner("正在分析..."):
                results = classify_text(text_input)
        
                for model, result in results.items():
                    st.subheader(f"{model.upper()} 模型分类结果:")
                    label = "垃圾信息" if result["label"] == "spam" else "正常信息"
                    if label == "垃圾信息":
                        st.error(f"⚠️ {label}")
                    else:
                        st.success(f"✅ {label}")
                    st.write(f"概率: {result['confidence']:.2f}")
                    st.write(f"详细输出: {json.dumps(result, ensure_ascii=False)}")
                    st.markdown("---")
        else:
            st.warning("请输入文本后再进行分类。")

with col2:
    st.subheader("使用说明")
    st.write("""
    1. 在左侧文本框中输入您想要分类的中文文本。
    2. 点击"分类"按钮。
    3. 系统将使用两个模型分析文本并显示结果。
    4. 结果包括两个模型的分类（垃圾信息或正常信息）、概率和JSON格式的详细输出。
    """)
    
    st.subheader("关于模型")
    st.write("""
    本分类器使用了两个模型：
    1. app-x/chinese_spam_classifier (HuggingFace模型)
    2. app-x/chinese_spam_classifier_onnx (ONNX模型)
    这两个模型都基于大规模中文数据集训练，能够有效识别各种类型的垃圾信息。
    """)
    
    st.subheader("免责声明")
    st.info("""
    此分类器仅作为辅助工具，不应完全依赖其结果。
    请始终保持警惕，谨慎处理可疑信息。
    """)

# 添加页脚
st.markdown("---")
st.markdown("由 Streamlit 和 Hugging Face 提供支持 | 作者：[app-x]")