cocoirun's picture
Update README.md
9fcb9ee verified
metadata
license: cc-by-nc-nd-4.0

Longformer인코더 KoBART둜 AIHUB 금육 및 콜 상담 λŒ€ν™” 데이터λ₯Ό CHATGPTλ₯Ό 톡해 μš”μ•½ν•œ ν•™μŠ΅ 데이터λ₯Ό ν•™μŠ΅ν•œ λͺ¨λΈ

input = """고객: μ•ˆλ…•ν•˜μ„Έμš”, μ œκ°€ μ—¬κΈ°μ„œ μ‚¬μš©ν•˜λŠ” μ‹ μš©μΉ΄λ“œμ— λŒ€ν•΄ κΆκΈˆν•œ 게 μžˆμ–΄μš”.

상담원: μ•ˆλ…•ν•˜μ„Έμš”! λ„€, μ–΄λ–€ λ¬Έμ˜κ°€ μžˆμœΌμ‹ κ°€μš”?

고객: μ œκ°€ 이번 달에 μΉ΄λ“œλ₯Ό μ‚¬μš©ν•˜λ©΄μ„œ λ¦¬μ›Œλ“œ 포인트λ₯Ό μ–Όλ§ˆλ‚˜ μŒ“μ•˜λŠ”μ§€ ν™•μΈν•˜κ³  μ‹Άμ–΄μš”.

상담원: λ„€, λ‹Ήμ‹ μ˜ λ¦¬μ›Œλ“œ 포인트 μž”μ•‘μ„ 확인해 λ“œλ¦΄ 수 μžˆμŠ΅λ‹ˆλ‹€. μ œκ°€ λ‹Ήμ‹ μ˜ μΉ΄λ“œ 번호λ₯Ό μž…λ ₯ν•˜κ³  ν™•μΈν•΄λ³Όκ²Œμš”. 번호λ₯Ό μ•Œλ €μ£Όμ‹€ 수 μžˆμ„κΉŒμš”?

고객: λ„€, 제 μΉ΄λ“œ λ²ˆν˜ΈλŠ” 1234-5678-9012-3456μž…λ‹ˆλ‹€.

상담원: κ°μ‚¬ν•©λ‹ˆλ‹€. μž μ‹œλ§Œ κΈ°λ‹€λ €μ£Όμ„Έμš”. 확인 μ€‘μ΄μ—μš”... λ„€, ν˜„μž¬ λ‹Ήμ‹ μ˜ λ¦¬μ›Œλ“œ 포인트 μž”μ•‘μ€ 3,250 ν¬μΈνŠΈμž…λ‹ˆλ‹€.

고객: μ•Œκ² μ–΄μš”, κ°μ‚¬ν•©λ‹ˆλ‹€! 그럼 좔가적인 이용 ν˜œνƒμ΄λ‚˜ 할인에 κ΄€ν•œ 정보도 얻을 수 μžˆμ„κΉŒμš”?

상담원: 물둠이죠! 저희 μΉ΄λ“œμ‚¬λŠ” λ‹€μ–‘ν•œ 이용 ν˜œνƒμ„ μ œκ³΅ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄, μ—¬ν–‰, μ‡Όν•‘, 식사 λ“± λ‹€μ–‘ν•œ λΆ„μ•Όμ—μ„œ 할인 ν˜œνƒμ„ 받을 수 μžˆκ±°λ‚˜, λ¦¬μ›Œλ“œ 포인트λ₯Ό μ‚¬μš©ν•˜μ—¬ μƒν’ˆμ΄λ‚˜ κΈ°ν”„νŠΈ μΉ΄λ“œλ‘œ κ΅ν™˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ–΄λ–€ ν˜œνƒμ— 관심이 μžˆμœΌμ‹ κ°€μš”?

고객: μ €λŠ” μ—¬ν–‰ ν• μΈμ΄λ‚˜ λ§ˆμΌλ¦¬μ§€ 적립에 관심이 μžˆμ–΄μš”.

상담원: 그런 κ²½μš°μ—λŠ” λ‹Ήμ‹ μ—κ²Œ μ ν•©ν•œ μ—¬ν–‰ μΉ΄λ“œ ν˜œνƒμ„ μ œκ³΅ν•˜λŠ” μΉ΄λ“œλ₯Ό μΆ”μ²œν•΄ λ“œλ¦΄ 수 μžˆμŠ΅λ‹ˆλ‹€. μ—¬ν–‰ μΉ΄λ“œλŠ” 항곡사 λ§ˆμΌλ¦¬μ§€λ₯Ό μŒ“μ„ 수 있고, ν˜Έν…” 할인 ν˜œνƒμ„ 받을 μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€. μ œκ°€ λͺ‡ 가지 μ˜΅μ…˜μ„ μ œμ•ˆν•΄ λ³ΌκΉŒμš”?

고객: λ„€, 그러면 쒋을 것 κ°™μ•„μš”. κ°μ‚¬ν•©λ‹ˆλ‹€!
상담원: 말씀해 μ£Όμ…”μ„œ κ°μ‚¬ν•©λ‹ˆλ‹€. 이제 μ œκ°€ λͺ‡ 가지 μΆ”μ²œμ„ λ“œλ¦¬λ„λ‘ ν•˜κ² μŠ΅λ‹ˆλ‹€. μ–΄λ–€ 항곡사λ₯Ό 주둜 μ΄μš©ν•˜μ‹œλ‚˜μš”?"""
output ="""
- 고객이 μ‹ μš©μΉ΄λ“œμ— λŒ€ν•΄ κΆκΈˆν•œ 사항 상담
- λ¦¬μ›Œλ“œ 포인트 확인 μš”μ²­
- 상담원이 μΉ΄λ“œ λ²ˆν˜Έμ™€ μž”μ•‘ 확인 ν›„ μΆ”κ°€ 이용 ν˜œνƒ μ•ˆλ‚΄
- 고객이 μ—¬ν–‰ 할인, λ§ˆμΌλ¦¬μ§€, ν˜Έν…” 할인 λ“± λ‹€μ–‘ν•œ ν˜œνƒμ— 관심 ν‘œν˜„
"""

ν•΄λ‹Ή λͺ¨λΈμ„ ν™œμš©ν•˜κΈ° μœ„ν•΄μ„œ λ‹€μŒκ³Ό 같은 class ν•„μš”

class LongformerSelfAttentionForBart(nn.Module):
    def __init__(self, config, layer_id):
        super().__init__()
        self.embed_dim = config.d_model
        self.longformer_self_attn = LongformerSelfAttention(config, layer_id=layer_id)
        self.output = nn.Linear(self.embed_dim, self.embed_dim)

    def forward(
        self,
        hidden_states: torch.Tensor,
        key_value_states: Optional[torch.Tensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        attention_mask: Optional[torch.Tensor] = None,
        layer_head_mask: Optional[torch.Tensor] = None,
        output_attentions: bool = False,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:

        is_cross_attention = key_value_states is not None
        bsz, tgt_len, embed_dim = hidden_states.size()

        # bs x seq_len x seq_len -> bs x seq_len 으둜 λ³€κ²½
        attention_mask = attention_mask.squeeze(dim=1)
        attention_mask = attention_mask[:,0]

        is_index_masked = attention_mask < 0
        is_index_global_attn = attention_mask > 0
        is_global_attn = is_index_global_attn.flatten().any().item()

        outputs = self.longformer_self_attn(
            hidden_states,
            attention_mask=attention_mask,
            layer_head_mask=None,
            is_index_masked=is_index_masked,
            is_index_global_attn=is_index_global_attn,
            is_global_attn=is_global_attn,
            output_attentions=output_attentions,
        )

        attn_output = self.output(outputs[0])

        return (attn_output,) + outputs[1:] if len(outputs) == 2 else (attn_output, None, None)
class LongformerEncoderDecoderForConditionalGeneration(BartForConditionalGeneration):
    def __init__(self, config):
        super().__init__(config)
        
        if config.attention_mode == 'n2':
            pass  # do nothing, use BertSelfAttention instead
        else:

            self.model.encoder.embed_positions = BartLearnedPositionalEmbedding(
                config.max_encoder_position_embeddings, 
                config.d_model)

            self.model.decoder.embed_positions = BartLearnedPositionalEmbedding(
                config.max_decoder_position_embeddings, 
                config.d_model)

            for i, layer in enumerate(self.model.encoder.layers):
                layer.self_attn = LongformerSelfAttentionForBart(config, layer_id=i)
class LongformerEncoderDecoderConfig(BartConfig):
    def __init__(self, attention_window: List[int] = None, attention_dilation: List[int] = None,
                 autoregressive: bool = False, attention_mode: str = 'sliding_chunks',
                 gradient_checkpointing: bool = False, **kwargs):
        """
        Args:
            attention_window: list of attention window sizes of length = number of layers.
                window size = number of attention locations on each side.
                For an affective window size of 512, use `attention_window=[256]*num_layers`
                which is 256 on each side.
            attention_dilation: list of attention dilation of length = number of layers.
                attention dilation of `1` means no dilation.
            autoregressive: do autoregressive attention or have attention of both sides
            attention_mode: 'n2' for regular n^2 self-attention, 'tvm' for TVM implemenation of Longformer
                selfattention, 'sliding_chunks' for another implementation of Longformer selfattention
        """
        super().__init__(**kwargs)
        self.attention_window = attention_window
        self.attention_dilation = attention_dilation
        self.autoregressive = autoregressive
        self.attention_mode = attention_mode
        self.gradient_checkpointing = gradient_checkpointing
        assert self.attention_mode in ['tvm', 'sliding_chunks', 'n2']

λͺ¨λΈ 였브젝트 λ‘œλ“œ ν›„ weightνŒŒμΌμ„ λ³„λ„λ‘œ λ‹€μš΄λ°›μ•„μ„œ load_state_dict둜 μ›¨μ΄νŠΈλ₯Ό λΆˆλŸ¬μ•Ό ν•©λ‹ˆλ‹€.

tokenizer = AutoTokenizer.from_pretrained("cocoirun/longforemr-kobart-summary-v1")
model = LongformerEncoderDecoderForConditionalGeneration.from_pretrained("cocoirun/longforemr-kobart-summary-v1")
device = torch.device('cuda')
model.load_state_dict(torch.load("summary weight.ckpt"))
model.to(device)

λͺ¨λΈ μš”μ•½ ν•¨μˆ˜

def summarize(text, max_len):
    max_seq_len = 4096
    context_tokens = ['<s>'] + tokenizer.tokenize(text) + ['</s>']
    input_ids = tokenizer.convert_tokens_to_ids(context_tokens) 

    if len(input_ids) < max_seq_len:   
            while len(input_ids) < max_seq_len: 
                input_ids += [tokenizer.pad_token_id] 

    else:
        input_ids = input_ids[:max_seq_len - 1] + [   
            tokenizer.eos_token_id]

    res_ids = model.generate(torch.tensor([input_ids]).to(device),
                                        max_length=max_len,
                                        num_beams=5,
                                        no_repeat_ngram_size = 3,
                                        eos_token_id=tokenizer.eos_token_id,
                                        bad_words_ids=[[tokenizer.unk_token_id]])        
    
    res = tokenizer.batch_decode(res_ids.tolist(), skip_special_tokens=True)[0]
    res = res.replace("\n\n","\n")
    return res