4장 Tokenization¶

Korean Tokenization with KoNLPy

1 Byte Pair Encoding¶

의미를 갖는 더 작은 Sub Word 로 구성 됩니다.

In [1]:

# 개별 단어내 의미를 가는 단위로 분할 합니다
text = "자연어 처리는 인공지능의 한 줄기 입니다"

from konlpy.tag import Mecab
Mecab().morphs(text)

Out[1]:

['자연어', '처리', '는', '인공지능', '의', '한', '줄기', '입니다']

5장 유사성과 모호성¶

Korean Tokenization with KoNLPy

1 다의어 동의어 식별방법¶

위키백과 Word2Vec 학습모델의 활용 word2vec PreTrained Model
WordNet 사전의 활용 부산대학교 korlex
RNN 모델을 활용하는 방법

2 Tf-IDF 다의어 동의어 식별방법¶

참신러닝 Tf-idf 자료 활용하기 Git blog

In [1]:

# 텍스트 데이터 불러오기
import json
with open("../backup/torch_data.json", "r") as f:
    docs = f.read()
docs = json.loads(docs)
doc1 = docs['doc1']
doc2 = docs['doc2']
doc3 = docs['doc3']

In [2]:

# 문서 내 Token의 빈도수 계산
import pandas as pd
def get_term_frequency(document, word_dict=None):
    if word_dict is None:
        word_dict = {}
    words = document.split()

    for w in words:
        word_dict[w] = 1 + (0 if word_dict.get(w) is None else word_dict[w])
    return pd.Series(word_dict).sort_values(ascending=False)

get_term_frequency(doc1)

Out[2]:

.     16
는     15
들     14
,     10
하     10
      ..
정상     1
면      1
이타     1
데      1
지능     1
Length: 186, dtype: int64

In [4]:

# 개별 token 이 몇개의 문서에서 출현하였는지를 Counting
def get_document_frequency(documents):
    dicts, df = [], {}
    vocab = set([])
    for d in documents:
        tf    = get_term_frequency(d)
        dicts += [tf]
        vocab = vocab | set(tf.keys())
    
    for v in list(vocab):
        df[v] = 0
        for dict_d in dicts:
            if dict_d.get(v) is not None:
                df[v] += 1
    return pd.Series(df).sort_values(ascending=False)

get_document_frequency([doc1, doc2])

Out[4]:

해서    2
한     2
기     2
겠     2
죠     2
     ..
데     1
평균    1
건     1
자신    1
풀     1
Length: 311, dtype: int64

In [5]:

# 전체 문서의 token 별 tf-idf 계산하는 함수
from operator import itemgetter
import numpy as np

def get_tfidf(docs):
    vocab, tfs = {}, []
    for d in docs:
        vocab =  get_term_frequency(d, vocab)
        tfs  += [get_term_frequency(d)]
    df = get_document_frequency(docs)

    stats = []
    for word, freq in vocab.items():
        tfidfs = []
        for idx in range(len(docs)):
            if tfs[idx].get(word) is not None:
                tfidfs += [tfs[idx][word] * np.log(len(docs) / df[word])]
            else: tfidfs += [0]
        stats.append((word, freq, *tfidfs, max(tfidfs)))
    
    _col = ('word','frequency','doc1','doc2','doc3','max')
    return pd.DataFrame(stats, columns=_col
            ).sort_values('max', ascending=False).reset_index(drop=True)

get_tfidf([doc1, doc2, doc3]).head()

Out[5]:

	word	frequency	doc1	doc2	doc3	max
0	남자	9	9.887511	0.000000	0.000000	9.887511
1	요인	6	0.000000	6.591674	0.000000	6.591674
2	심리학	5	5.493061	0.000000	0.000000	5.493061
3	었	4	0.000000	0.000000	4.394449	4.394449
4	제	4	0.000000	0.000000	4.394449	4.394449

In [6]:

# 전체 문서의 tf 벡터 만들기 (tf-idf 와 비교활용이 가능)
def get_tf(docs):
    vocab, tfs = {}, []
    for d in docs:
        vocab = get_term_frequency(d, vocab)
        tfs  += [get_term_frequency(d)]

    stats = []
    for word, freq in vocab.items():
        tf_v = []
        for idx in range(len(docs)):
            if tfs[idx].get(word) is not None:
                tf_v += [tfs[idx][word]]
            else: tf_v += [0]
        stats.append((word, freq, *tf_v))
    _col = ('word','frequency','doc1','doc2','doc3')
    return pd.DataFrame(stats, columns=_col).sort_values('frequency', ascending=False)

get_tf([doc1, doc2, doc3]).head()

Out[6]:

	word	frequency	doc1	doc2	doc3
0	는	47	15	14	18
1	을	39	8	10	21
2	.	36	16	10	10
3	하	33	10	9	14
4	이	32	8	8	16

3 Context 윈도우로 동시출현 단어정보의 활용¶

https://github.com/juneoh/fastcampus-pytorch-nlp 에서 자료 다운받기

Document 에서 동시에 출현하는 단어를 활용하는 방법으로 N-Gram 에서 PMI 수집과 비슷
windowing 은 일정한 크기의 window 를 움직이며 내부의 Unit 정보를 수집합니다.
Context Window : 정보를 수집할 때 사용되는 윈도우로 인접한 단어들의 빈도를 계산한 행렬
윈도우 크기를 Hyper Parametor 로 사용자가 특정을 해야 합니다

In [3]:

%%time
from collections import defaultdict
import pandas as pd
with open('../backup/ted.aligned.ko.refined.tok.txt') as f:
    lines = [l.strip() for l in f.read().splitlines() if l.strip()]

# context window 를 생성하는 함수
def get_context_counts(lines, w_size=2):
    co_dict   = defaultdict(int)
    for line in lines:
        words = line.split()
        for i, w in enumerate(words):
            for c in words[i - w_size:i + w_size]:
                if w != c: co_dict[(w, c)] += 1
    return pd.Series(co_dict)

co_dict = get_context_counts(lines)
co_dict

CPU times: user 19.3 s, sys: 714 ms, total: 20 s
Wall time: 20 s

Out[3]:

웹    현재        1
     TED       2
     사이트     350
사이트  TED       3
     웹       365
            ... 
산책   바퀴벌레      1
말벌   시키        1
대해   말벌        1
기생충  안         1
생각   기생충       1
Length: 2706519, dtype: int64

In [8]:

tfs = get_term_frequency(' '.join(lines))
tfs

Out[8]:

.      337685
는      234692
이      224422
을      193583
은      137351
        ...  
711         1
967         1
선거제         1
아쉬웠         1
끼친다         1
Length: 62960, dtype: int64

In [9]:

tfs = tfs[tfs < 100000]
tfs.head()

Out[9]:

고    98136
것    78869
한    63132
죠    56568
그    53895
dtype: int64

In [10]:

%%time
# 앞에서 작성한 동시발생 정보를 활용하여 벡터 데이터를 생성합니다
def co_occurrence(co_dict, vocab):
    data = []
    for word1 in vocab:
        row = []
        for word2 in vocab:
            try: count = co_dict[(word1, word2)]
            except KeyError: count = 0
            row.append(count)   
        data.append(row)
    return pd.DataFrame(data, index=vocab, columns=vocab)
      
import torch
co = co_occurrence(co_dict, tfs.index[:1000])
torch.save(co, 'co.pth')  # 결과를 co.pth 로 저장 합니다
co.head()

CPU times: user 1min 8s, sys: 91.9 ms, total: 1min 8s
Wall time: 1min 8s

Out[10]:

	고	것	한	죠	그	입니다	우리	수	에서	었	...	시기	신호	운영	다루	갑자기	딸	신체	미국인	만일	생명체
고	0	1034	165	7	1788	8	839	1288	924	3204	...	6	8	127	103	17	10	5	11	1	5
것	364	0	5473	13	282	15019	36	4	563	1109	...	5	3	6	37	11	2	1	0	2	3
한	1670	5202	0	104	529	57	267	54	1252	41	...	75	27	5	2	20	4	20	16	8	32
죠	1307	2636	363	0	153	2	107	1391	96	5316	...	20	12	7	12	1	0	0	2	3	3
그	2939	568	219	827	0	578	1045	46	628	517	...	34	19	1	0	28	6	5	5	28	7

5 rows × 1000 columns

4 유사도 측정¶

$$ \text{d}_{\text{L1}}(w,v)=\sum_{i=1}^d{|w_i-v_i|},\text{ where }w,v\in\mathbb{R}^d. $$

In [12]:

import torch
def get_nearest(query, dataframe, metric, top_k, ascending=True):
    vector = torch.FloatTensor(dataframe.loc[query].values)
    distances = dataframe.apply(lambda x: metric(vector, torch.FloatTensor(x.values)), axis=1)
    top_distances = distances.sort_values(ascending=ascending)[:top_k]
    print(', '.join([f'{k} ({v:.1f})' for k, v in top_distances.items()]))

# "우리" 단어와 l1 거리측정 함수를 활용한 계산  
def get_l1_distance(x1, x2):
    return ((x1 - x2).abs()).sum()

print('L1 distance:')
get_nearest('우리', co, get_l1_distance, 30)

L1 distance:
우리 (0.0), 저 (24246.0), 제 (26486.0), 여러분 (30102.0), 그 (33590.0), 그것 (34206.0), 이런 (34273.0), 이것 (34569.0), 그리고 (34732.0), 어떤 (35048.0), 나 (35285.0), ' (35615.0), -- (35832.0), 요 (35935.0), 그런 (36570.0), 당신 (36757.0), 바로 (36803.0), 여기 (36804.0), 하지만 (36811.0), 그래서 (36900.0), 어떻게 (36977.0), 다 (37061.0), 저희 (37277.0), 모든 (37366.0), 살 (37483.0), 미국 (37631.0), 새로운 (37684.0), 다른 (37693.0), 사실 (37719.0), 까지 (37754.0)

$$ \text{d}_{\text{L2}}(w,v)=\sqrt{\sum_{i=1}^d{(w_i-v_i)^2}},\text{ where }w,v\in\mathbb{R}^d. $$

In [13]:

# l2 거리측정 함수
def get_l1_distance(x1, x2):
    return ((x1 - x2).abs()).sum()

print('L1 distance:')
get_nearest('우리', co, get_l1_distance, 30)

L1 distance:
우리 (0.0), 저 (24246.0), 제 (26486.0), 여러분 (30102.0), 그 (33590.0), 그것 (34206.0), 이런 (34273.0), 이것 (34569.0), 그리고 (34732.0), 어떤 (35048.0), 나 (35285.0), ' (35615.0), -- (35832.0), 요 (35935.0), 그런 (36570.0), 당신 (36757.0), 바로 (36803.0), 여기 (36804.0), 하지만 (36811.0), 그래서 (36900.0), 어떻게 (36977.0), 다 (37061.0), 저희 (37277.0), 모든 (37366.0), 살 (37483.0), 미국 (37631.0), 새로운 (37684.0), 다른 (37693.0), 사실 (37719.0), 까지 (37754.0)

$$ d_{\infty}(w,v)=\max(|w_1-v_1|,|w_2-v_2|,\cdots,|w_d-v_d|),\text{ where }w,v\in\mathbb{R}^d $$

In [14]:

def get_infinity_distance(x1, x2):
    return ((x1 - x2).abs()).max()

print('\nInfinity distance:')
get_nearest('우리', co, get_infinity_distance, 30)

Infinity distance:
우리 (0.0), 저 (852.0), 여러분 (1031.0), 자신 (1309.0), 모두 (1316.0), 나 (1376.0), 물 (1396.0), 당신 (1396.0), 영향 (1424.0), 스스로 (1434.0), 그녀 (1447.0), 필요 (1456.0), 어떤 (1459.0), 이런 (1465.0), 서로 (1481.0), 아이 (1483.0), 그 (1485.0), 이렇게 (1487.0), 만 (1487.0), 가장 (1499.0), 누구 (1501.0), 저희 (1501.0), 이야기 (1506.0), 로 (1506.0), 도움 (1511.0), 큰 (1512.0), 매우 (1517.0), 일 (1518.0), 질문 (1519.0), 보여 (1521.0)

$$ \begin{aligned} \text{sim}_{\text{cos}}(w,v)&=\overbrace{\frac{w\cdot v}{|w||v|}}^{\text{dot product}} =\overbrace{\frac{w}{|w|}}^{\text{unit vector}}\cdot\frac{v}{|v|} \\ &=\frac{\sum_{i=1}^{d}{w_iv_i}}{\sqrt{\sum_{i=1}^d{w_i^2}}\sqrt{\sum_{i=1}^d{v_i^2}}} \\ \text{where }&w,v\in\mathbb{R}^d \end{aligned} $$

In [15]:

def get_cosine_similarity(x1, x2):
    return (x1 * x2).sum() / ((x1**2).sum()**.5 * (x2**2).sum()**.5)

print('\nCosine similarity:')
get_nearest('우리', co, get_cosine_similarity, 30, ascending=False)

Cosine similarity:
우리 (1.0), 저희 (0.9), 저 (0.8), 그녀 (0.8), 그 (0.8), 여러분 (0.7), 제 (0.7), 당신 (0.7), 누군가 (0.7), 그것 (0.7), 우린 (0.7), 새 (0.7), 일종 (0.7), 이것 (0.7), 환자 (0.7), 하루 (0.7), 그냥 (0.6), 마치 (0.6), 전화 (0.6), 각각 (0.6), 인터넷 (0.6), 컴퓨터 (0.6), 수많 (0.6), 뭔가 (0.6), 빛 (0.6), 누가 (0.6), 로봇 (0.6), 약간 (0.6), 물론 (0.6), 나무 (0.6)

$$ \begin{aligned} \text{sim}_{\text{jaccard}}(w,v)&=\frac{|w \cap v|}{|w \cup v|} \\ &=\frac{|w \cap v|}{|w|+|v|-|w \cap v|} \\ &\approx\frac{\sum_{i=1}^d{\min(w_i,v_i)}}{\sum_{i=1}^d{\max(w_i,v_i)}} \\ \text{where }&w,v\in\mathbb{R}^d. \end{aligned} $$

In [16]:

def get_jaccard_similarity(x1, x2):
    return torch.stack([x1, x2]).min(dim=0)[0].sum() / torch.stack([x1, x2]).max(dim=0)[0].sum()

print('\nJaccard similarity:')
get_nearest('우리', co, get_jaccard_similarity, 30, ascending=False)

Jaccard similarity:
우리 (1.0), 그 (0.5), 저 (0.5), 제 (0.4), 여러분 (0.3), 말 (0.3), " (0.3), 그리고 (0.3), 사람 (0.3), 일 (0.3), 도 (0.2), 다른 (0.2), 나 (0.2), 이런 (0.2), 한 (0.2), 만 (0.2), 모든 (0.2), 어떤 (0.2), 더 (0.2), 와 (0.2), 에서 (0.2), 과 (0.2), 로 (0.2), 보 (0.2), 요 (0.2), 된 (0.2), ' (0.2), 그것 (0.2), 적 (0.2), 되 (0.2)

4 Lesk Algorithm : 시소러스(WordNet) 기반의 중의성(WSD) 해소¶

단어의 중의성 (WSD: Word Sense Disambiguation) 을 알고리즘으로 극복가능

wordnet 내부 설명과, 분석대상 문장 의 단어 유사도를 측정합니다.
이를 활용하여 어떤 문장과 연관성이 높은지를 식별 합니다

In [17]:

from nltk.corpus import wordnet as wn
for ss in wn.synsets('bass'):
    print(ss, ss.definition())

Synset('bass.n.01') the lowest part of the musical range
Synset('bass.n.02') the lowest part in polyphonic music
Synset('bass.n.03') an adult male singer with the lowest voice
Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
Synset('freshwater_bass.n.01') any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)
Synset('bass.n.06') the lowest adult male singing voice
Synset('bass.n.07') the member with the lowest range of a family of musical instruments
Synset('bass.n.08') nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes
Synset('bass.s.01') having or denoting a low vocal or instrumental range

In [18]:

from nltk.wsd import lesk
def lesk_token(sentence, word):
    sentence = sentence.split()
    best_synset = lesk(sentence, word)
    return best_synset, best_synset.definition()

sentence1 = 'I went fishing last weekend and I got a bass and cooked it'
sentence2 = 'I love the music from the speaker which has strong beat and bass'
sentence3 = 'I think the bass is more important than guitar'

word = 'bass'
for _ in [sentence1, sentence2, sentence3]:
    print(lesk_token(str(_), word))

(Synset('sea_bass.n.01'), 'the lean flesh of a saltwater fish of the family Serranidae')
(Synset('bass.n.02'), 'the lowest part in polyphonic music')
(Synset('sea_bass.n.01'), 'the lean flesh of a saltwater fish of the family Serranidae')

5 Philip Resnik 선택 선호도: 중의성(WSD) 해소¶

문장 내 출현한 단어는 주변 단어들에 따라 의미가 결정 됩니다
주변 단어들의 품사(분포) 차이가 크면, 해당 술어가 강한 선택 선호도를 갖습니다.
이를 공식화 하여 알고리름을 정의 합니다

6 Ketrin Erk 단어간 유사도 측정¶

Similarity Based Method [Erk et al.2007]

데이터를 기반 으로, 별도의 시소러스 없이도 단어간 유사도를 측정
$W$ 술어와 $h$ 표제어, 그리고 두 단어 사이의 관계인 $R$ 튜플(tuple) 로 주어집니다
이러한 계산시 문제점은 0 희소벡터의 값들이 많이 발생하는 것입니다.
희소벡터를 줄이기 위해 임베팅 기법으로 Dense 벡터 를 만들어서 극복 합니다

$$ (w,h,R),\text{ where }R\text{ is a relationship, such as verb-object}. $$$$ A_R(w,h_0)=\sum_{h\in\text{Seen}_R(w)}{\text{sim}(h_0,h)\cdot \phi_R(w,h)} $$

In [19]:

%%time
# 데이터 불러오기
with open('../backup/ted.aligned.ko.refined.tok.txt') as f:
    lines = [l.strip() for l in f.read().splitlines() if l.strip()]

# Seen(w) 함수의 계산 : 문장 내 서술어("VV") 와 표제어("NNG") 를 찾는다 
from konlpy.tag import Mecab
def count_seen_headwords(lines, predicate='VV', headword='NNG'):
    tagger, seen_dict  = Mecab(), {}    
    for line in lines:
        pos_result     = tagger.pos(line)
        word_h, word_p = None, None
        for word, pos in pos_result:
            if pos == predicate or pos[:3] == predicate + '+':
                word_p = word; break
            if pos == headword:
                word_h = word
        
        if word_h is not None and word_p is not None:
            seen_dict[word_p] = [word_h] +\
                ([] if seen_dict.get(word_p) is None else seen_dict[word_p])            
    return seen_dict

seen_headwords = count_seen_headwords(lines)
list(seen_headwords.keys())[:10] # key 값들 살펴보기
seen_headwords[list(seen_headwords.keys())[0]][:10]

CPU times: user 47.3 s, sys: 128 ms, total: 47.4 s
Wall time: 47.4 s

Out[19]:

['사실', '촌충', '코', '시장', '소유자', '아내', '결국', '미래', '거론', '경우']

In [20]:

# 생성된 Seen() 데이터를 바탕으로 선택관련도 점수를 계산 합니다
import torch
def get_selectional_association(predicate, headword, lines, dataframe, metric):
    v1    = torch.FloatTensor(dataframe.loc[headword].values)
    seens = seen_headwords[predicate]
    total = 0
    for seen in seens:
        try:
            v2     = torch.FloatTensor(dataframe.loc[seen].values)
            total += metric(v1, v2)
        except: pass
    return total

In [21]:

co = torch.load('co.pth')

# 계산 결과를 활용하여 주어진 술어에 대해 올바른 headword 를 찾는 wsd 함수
def get_cosine_similarity(x1, x2):
    return (x1 * x2).sum() / ((x1**2).sum()**.5 * (x2**2).sum()**.5)

def wsd(predicate, headwords):
    selectional_associations = []
    for h in headwords:
        selectional_associations += [get_selectional_association(
            predicate, h, lines, co, get_cosine_similarity)]
    print(selectional_associations)
    
wsd('가', ['학교', '사람', '질문'])
# wsd('피우', ['담배', '맥주', '사과'])

[tensor(3265.9070), tensor(3522.6216), tensor(3642.2668)]