Language Identification
该模型是基于 AllenNLP 在 qgyd2021/language_identification 数据集上训练的语种识别模型。
在 valid 验证集上的准确率情况:
语种 | 样本数量 | 准确率 |
---|---|---|
af | 6221 | 0.8666 |
ar | 19808 | 0.9994 |
bg | 19913 | 0.9958 |
bn | 7396 | 0.9968 |
bs | 1653 | 0.8232 |
cs | 19122 | 0.9615 |
da | 19500 | 0.9727 |
de | 19702 | 0.996 |
el | 19455 | 0.9761 |
en | 39710 | 0.9942 |
eo | 18542 | 0.9944 |
es | 19924 | 0.9937 |
et | 19482 | 0.9727 |
fi | 19223 | 0.9554 |
fo | 4612 | 0.9697 |
fr | 19990 | 0.9957 |
ga | 19949 | 0.9973 |
gl | 508 | 0.822 |
hi | 19984 | 0.9965 |
hi_en | 1358 | 0.951 |
hr | 18840 | 0.9789 |
hu | 669 | 0.8873 |
hy | 124 | 0.9688 |
id | 4669 | 0.9968 |
is | 19795 | 0.9876 |
it | 19742 | 0.9941 |
ja | 20130 | 0.9996 |
ko | 20098 | 0.9998 |
lt | 19280 | 0.9721 |
lv | 19459 | 0.9931 |
mr | 10300 | 0.9961 |
mt | 19708 | 0.993 |
nl | 18452 | 0.9258 |
no | 19404 | 0.9714 |
pl | 19920 | 0.9973 |
pt | 19996 | 0.9946 |
ro | 19804 | 0.9944 |
ru | 20003 | 0.9954 |
sk | 19804 | 0.9861 |
sl | 19665 | 0.9926 |
sv | 18941 | 0.95 |
sw | 19768 | 0.9871 |
th | 19917 | 0.9991 |
tl | 19572 | 0.9991 |
tn | 19883 | 0.9933 |
tr | 19809 | 0.9939 |
ts | 19752 | 0.9854 |
uk | 17643 | 0.9994 |
ur | 19895 | 0.992 |
vi | 19836 | 0.9982 |
yo | 1936 | 0.9827 |
zh | 40108 | 0.9996 |
zu | 5406 | 0.9905 |
测试代码:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import argparse
import time
from allennlp.models.archival import archive_model, load_archive
from allennlp.predictors.text_classifier import TextClassifierPredictor
from project_settings import project_path
def get_args():
"""
python3 step_5_predict_by_archive.py
:return:
"""
parser = argparse.ArgumentParser()
parser.add_argument(
"--text",
default="hello guy.",
type=str
)
parser.add_argument(
"--archive_file",
default=(project_path / "trained_models/language_identification").as_posix(),
type=str
)
args = parser.parse_args()
return args
def main():
args = get_args()
archive = load_archive(archive_file=args.archive_file)
predictor = TextClassifierPredictor(
model=archive.model,
dataset_reader=archive.dataset_reader,
)
json_dict = {
"sentence": args.text
}
begin_time = time.time()
outputs = predictor.predict_json(
json_dict
)
label = outputs["label"]
prob = round(max(outputs["probs"]), 4)
print(label)
print(prob)
print('time cost: {}'.format(time.time() - begin_time))
return
if __name__ == '__main__':
main()
requirements.txt
allennlp==2.10.1
allennlp-models==2.10.1
torch==1.12.1
overrides==1.9.0
pytorch_pretrained_bert==0.6.2
- Downloads last month
- 3