|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
λλμ νκΈ νΉν λ°μ΄ν°λ‘ μ¬μ νμ΅ (pre-training)μ μ§νν DeBERTa-v2 λͺ¨λΈμ
λλ€. |
|
|
|
νΉν λ¬Έμμ abstract, claims, description μμ£Όμ ν
μ€νΈλ‘ μ¬μ νμ΅μ΄ μ§νλμμ΅λλ€. |
|
|
|
νΉν λ¬Έμ μλ² λ© κ³μ°, νΉμ νΉν λ¬Έμ λΆλ₯λ±μ νμ€ν¬μ νμ©ν μ μλ νκΈ μΈμ΄λͺ¨λΈ (Language Model)μ
λλ€. |
|
|
|
## Patent Text Embedding κ³μ° μμ |
|
|
|
```Python |
|
patent_abstract = '''λ³Έ λ°λͺ
μ νΉν κ²μ μμ€ν
λ° κ²μ λ°©λ²μ κ΄ν κ²μΌλ‘, λ³΄λ€ μμΈνκ²λ μ
λ ₯ν κ²μμ΄μ λμμ΄λ₯Ό μ 곡, κ²μμ΄λ₯Ό μλμΌλ‘ λ²μνμ¬ κ΅κ°μ μκ΄μμ΄ κ²μμ κ°λ₯ν λ‘ νκ±°λ λλΆλ₯, μ€λΆλ₯, μλΆλ₯ λ± λΆλ₯ν κ²μμ΄λ₯Ό μ‘°ν©νμ¬ κ²μμ νν¨μΌλ‘μ¨, ν¨μ¨μ μΈ μ νκΈ°μ μ κ²μν μ μλλ‘ νλ νΉν κ²μ μμ€ν
λ° κ²μ λ°©λ²μ κ΄ν κ²μ΄λ€. |
|
νΉν κ²μ, μ μ¬λ, ν€μλ μΆμΆ, κ²μμ ''' |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("LDKSolutions/KR-patent-deberta-large") |
|
|
|
encoded_inputs = tokenizer(patent_abstract, max_length=512, truncation=True, padding="max_length", return_tensors="pt") |
|
|
|
model = AutoModel.from_pretrained("LDKSolutions/KR-patent-deberta-large") |
|
|
|
model.eval() |
|
|
|
with torch.no_grad(): |
|
outputs = model(**encoded_inputs)[0][:,0,:] # CLS-Pooling |
|
print(outputs.shape) # [1, 2048] |
|
``` |
|
|
|
|