metadata
license: apache-2.0
λλμ νκΈ νΉν λ°μ΄ν°λ‘ μ¬μ νμ΅ (pre-training)μ μ§νν DeBERTa-v2 λͺ¨λΈμ λλ€.
νΉν λ¬Έμμ abstract, claims, description μμ£Όμ ν μ€νΈλ‘ μ¬μ νμ΅μ΄ μ§νλμμ΅λλ€.
νΉν λ¬Έμ μλ² λ© κ³μ°, νΉμ νΉν λ¬Έμ λΆλ₯λ±μ νμ€ν¬μ νμ©ν μ μλ νκΈ μΈμ΄λͺ¨λΈ (Language Model)μ λλ€.
Patent Text Embedding κ³μ° μμ
patent_abstract = '''λ³Έ λ°λͺ
μ νΉν κ²μ μμ€ν
λ° κ²μ λ°©λ²μ κ΄ν κ²μΌλ‘, λ³΄λ€ μμΈνκ²λ μ
λ ₯ν κ²μμ΄μ λμμ΄λ₯Ό μ 곡, κ²μμ΄λ₯Ό μλμΌλ‘ λ²μνμ¬ κ΅κ°μ μκ΄μμ΄ κ²μμ κ°λ₯ν λ‘ νκ±°λ λλΆλ₯, μ€λΆλ₯, μλΆλ₯ λ± λΆλ₯ν κ²μμ΄λ₯Ό μ‘°ν©νμ¬ κ²μμ νν¨μΌλ‘μ¨, ν¨μ¨μ μΈ μ νκΈ°μ μ κ²μν μ μλλ‘ νλ νΉν κ²μ μμ€ν
λ° κ²μ λ°©λ²μ κ΄ν κ²μ΄λ€.
νΉν κ²μ, μ μ¬λ, ν€μλ μΆμΆ, κ²μμ '''
tokenizer = AutoTokenizer.from_pretrained("LDKSolutions/KR-patent-deberta-large")
encoded_inputs = tokenizer(patent_abstract, max_length=512, truncation=True, padding="max_length", return_tensors="pt")
model = AutoModel.from_pretrained("LDKSolutions/KR-patent-deberta-large")
model.eval()
with torch.no_grad():
outputs = model(**encoded_inputs)[0][:,0,:] # CLS-Pooling
print(outputs.shape) # [1, 2048]