Append¶

통계적 문법접근방법 빠르고 간단하게 살펴보기

1 Parsing Tree¶

문법태그를 활용한 문법구조 생성하기

In [1]:

# %%time
# text = '여배우 박민영은 높은 싱크로를 보여줬다'

# from konlpy.tag import Twitter
# twitter = Twitter()
# words = twitter.pos(text, stem=True)
# print(words)

In [2]:

# from nltk import RegexpParser

# grammar = """
# NP: {<N.*>*<Suffix>?}   # 명사구를 정의한다
# VP: {<V.*>*}            # 동사구를 정의한다
# AP: {<A.*>*}            # 형용사구를 정의한다 """
# parser = RegexpParser(grammar)
# parser

In [3]:

# chunks = parser.parse(words)
# chunks

In [4]:

# text_tree = [list(txt)    for txt in chunks.subtrees()]
# from pprint import pprint
# pprint(text_tree[1:])

2. CFG 분석방법 맛보기¶

정해진 유형에 따른 문법적 구조예제 활용하기

In [5]:

from nltk.grammar import toy_pcfg2
grammar = toy_pcfg2
print(grammar)

Grammar with 23 productions (start state = S)
    S -> NP VP [1.0]
    VP -> V NP [0.59]
    VP -> V [0.4]
    VP -> VP PP [0.01]
    NP -> Det N [0.41]
    NP -> Name [0.28]
    NP -> NP PP [0.31]
    PP -> P NP [1.0]
    V -> 'saw' [0.21]
    V -> 'ate' [0.51]
    V -> 'ran' [0.28]
    N -> 'boy' [0.11]
    N -> 'cookie' [0.12]
    N -> 'table' [0.13]
    N -> 'telescope' [0.14]
    N -> 'hill' [0.5]
    Name -> 'Jack' [0.52]
    Name -> 'Bob' [0.48]
    P -> 'with' [0.61]
    P -> 'under' [0.39]
    Det -> 'the' [0.41]
    Det -> 'a' [0.31]
    Det -> 'my' [0.28]

In [6]:

# # Early Chart 분석방법 맛보기
# import nltk
# nltk.parse.featurechart.demo( print_times   = False, 
#                               print_grammar = True, 
#                               parser = nltk.parse.featurechart.FeatureChartParser, 
#                               sent   = 'I saw a dog' )

3 Word Net을 활용한 명사/동사/형용사 의미분석¶

SynSet 내용 살펴보기 ( Word Net 에 포함된 같은단어 Node 모음)

In [7]:

from nltk.corpus import wordnet as wn
wn.synsets('dog')

Out[7]:

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

In [8]:

wn.synset('frump.n.01').examples()

Out[8]:

['she got a reputation as a frump', "she's a real dog"]

In [9]:

wn.synset('frump.n.01').definition()

Out[9]:

'a dull unattractive unpleasant girl or woman'

3 SynSet 활용하기¶

Wordnet을 활용하여 단의 의미 구분하기

In [10]:

# NLTK 기본 모듈에 포함된 wordnet DB를 활용
! pip3 install pywsd

Requirement already satisfied: pywsd in /home/markbaum/Python/nltk/lib/python3.6/site-packages (1.1.7)
Requirement already satisfied: pandas in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pywsd) (0.23.3)
Requirement already satisfied: nltk in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pywsd) (3.3)
Requirement already satisfied: numpy in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pywsd) (1.14.5)
Requirement already satisfied: pytz>=2011k in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pandas->pywsd) (2018.5)
Requirement already satisfied: python-dateutil>=2.5.0 in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pandas->pywsd) (2.7.3)
Requirement already satisfied: six in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from nltk->pywsd) (1.11.0)

In [11]:

sent = 'He act like a real dog'
ambiguous = 'dog'

from pywsd.lesk import simple_lesk
answer = simple_lesk(sent, ambiguous)
answer

Warming up PyWSD (takes ~10 secs)... took 2.9580390453338623 secs.

Out[11]:

Synset('frump.n.01')

In [12]:

answer.definition()

Out[12]:

'a dull unattractive unpleasant girl or woman'

In [13]:

sent = 'He looks like dirty dog'
ambiguous = 'dog'
answer = simple_lesk(sent, ambiguous)
answer

Out[13]:

Synset('cad.n.01')

In [14]:

answer.definition()

Out[14]:

'someone who is morally reprehensible'

4 NLTK 객체 활용¶

nltk 객체를 활용하여 작업을 효율적으로 활용한다

01 nltk 객체 정의하기¶

Token List 객체를 생성한 뒤, 이를 활용하여 nltk 객체를 만든다

In [15]:

# # 삼성전자 지속가능경영 보고서
# skipword = ['갤러시', '가치창출']

# from txtutil import txtnoun
# from nltk.tokenize import word_tokenize
# texts  = txtnoun("../data/kr-Report_2018.txt", skip=skipword)
# tokens = word_tokenize(texts)
# tokens[:5]

In [16]:

# # nltk Token 객체를 활용한 다양한 메소드를 제동
# import nltk
# ss_nltk = nltk.Text(tokens, name='2018지속성장')
# ss_nltk

02 nltk 객체 활용하기¶

내부 메서드를 활용한다

In [17]:

# # 객체의 이름을 출력
# ss_nltk.name

In [18]:

# # Token 과 연어관계에 있는 단어목록
# ss_nltk.collocations(num=30, window_size=2)

In [19]:

# # Token의 주변에 등장하는 단어들
# ss_nltk.common_contexts(['책임경영'])

In [20]:

# # 인접하여 위치하는 Token 을 출력
# ss_nltk.concordance('책임경영', lines=2)

In [21]:

# ss_nltk.concordance_list('책임경영')[1]

In [22]:

# # Token 의 빈도값 출력
# ss_nltk.count('책임경영')

In [23]:

# %matplotlib inline
# from matplotlib import rc
# rc('font', family=['NanumGothic','Malgun Gothic'])

# # 해당 단어별 출현빈도 비교출력
# ss_nltk.dispersion_plot(['책임경영', '경영진', '갤럭시', '갤러시', '업사이클링'])

In [24]:

# # 객체의 빈도를 Matplot linechart 로 출력
# ss_nltk.plot(10)

In [25]:

# # ko.readability('biline')
# ss_nltk.similar('삼성전자',num=3)

03 nltk.vocab() 객체 활용하기¶

Token 객체들 다루기

In [26]:

# # Token의 출현빈도 상위객체 출력
# # ko.tokens(['초등학교', '저학년'])
# ss_nltk.vocab().most_common(10)

In [27]:

# list(ss_nltk.vocab().keys())[:5]

In [28]:

# list(ss_nltk.vocab().values())[:5]

In [29]:

# ss_nltk.vocab().freq('삼성전자')