File size: 1,403 Bytes
9f77e31 46e4632 4eccfb7 46e4632 5e58abc 46e4632 5e58abc 46e4632 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
---
license: apache-2.0
---
λλμ νκΈ νΉν λ°μ΄ν°λ‘ μ¬μ νμ΅ (pre-training)μ μ§νν DeBERTa-v2 λͺ¨λΈμ
λλ€.
νΉν λ¬Έμμ abstract, claims, description μμ£Όμ ν
μ€νΈλ‘ μ¬μ νμ΅μ΄ μ§νλμμ΅λλ€.
νΉν λ¬Έμ μλ² λ© κ³μ°, νΉμ νΉν λ¬Έμ λΆλ₯λ±μ νμ€ν¬μ νμ©ν μ μλ νκΈ μΈμ΄λͺ¨λΈ (Language Model)μ
λλ€.
## Patent Text Embedding κ³μ° μμ
```Python
patent_abstract = '''λ³Έ λ°λͺ
μ νΉν κ²μ μμ€ν
λ° κ²μ λ°©λ²μ κ΄ν κ²μΌλ‘, λ³΄λ€ μμΈνκ²λ μ
λ ₯ν κ²μμ΄μ λμμ΄λ₯Ό μ 곡, κ²μμ΄λ₯Ό μλμΌλ‘ λ²μνμ¬ κ΅κ°μ μκ΄μμ΄ κ²μμ κ°λ₯ν λ‘ νκ±°λ λλΆλ₯, μ€λΆλ₯, μλΆλ₯ λ± λΆλ₯ν κ²μμ΄λ₯Ό μ‘°ν©νμ¬ κ²μμ νν¨μΌλ‘μ¨, ν¨μ¨μ μΈ μ νκΈ°μ μ κ²μν μ μλλ‘ νλ νΉν κ²μ μμ€ν
λ° κ²μ λ°©λ²μ κ΄ν κ²μ΄λ€.
νΉν κ²μ, μ μ¬λ, ν€μλ μΆμΆ, κ²μμ '''
tokenizer = AutoTokenizer.from_pretrained("LDKSolutions/KR-patent-deberta-large")
encoded_inputs = tokenizer(patent_abstract, max_length=512, truncation=True, padding="max_length", return_tensors="pt")
model = AutoModel.from_pretrained("LDKSolutions/KR-patent-deberta-large")
model.eval()
with torch.no_grad():
outputs = model(**encoded_inputs)[0][:,0,:] # CLS-Pooling
print(outputs.shape) # [1, 2048]
```
|