Working Process¶

데이터 수집하기
수집데이터 전처리 NDBMS
Token 의 정리 (wordCloud)
N-Gram 의 정리 (메뉴명 PMI)
Clustering (군집화 묶기)
Token 간 관계 분석하기
데이터 정리 및 모델링

0 Matplotlib 폰트 연결하기¶

https://programmers.co.kr/learn/courses/21/lessons/950

matplotlib 설정파일 찾기
시스템 설치된 폰트내용 찾기
시스템 폰트를 matplotlib 와 연결하기

In [1]:

# Python 에 설치된 matplotlib 설정파일 찾기
import matplotlib as mpl
mpl.matplotlib_fname()

Out[1]:

'/home/momukji/Python/python/lib/python3.6/site-packages/matplotlib/mpl-data/matplotlibrc'

In [2]:

import matplotlib.font_manager as fm
[(f.name, f.fname) for f in fm.fontManager.ttflist if 'ubuntu' in f.name.lower()]

Out[2]:

[('Ubuntu', '/usr/share/fonts/truetype/ubuntu/Ubuntu-BI.ttf'),
 ('Ubuntu Mono', '/usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf'),
 ('Ubuntu', '/usr/share/fonts/truetype/ubuntu/Ubuntu-MI.ttf'),
 ('Ubuntu', '/usr/share/fonts/truetype/ubuntu/Ubuntu-R.ttf'),
 ('Ubuntu Mono', '/usr/share/fonts/truetype/ubuntu/UbuntuMono-B.ttf'),
 ('Ubuntu Mono derivative Powerline',
  '/home/momukji/.local/share/fonts/Ubuntu Mono derivative Powerline.ttf'),
 ('Ubuntu', '/usr/share/fonts/truetype/ubuntu/Ubuntu-LI.ttf'),
 ('Ubuntu', '/usr/share/fonts/truetype/ubuntu/Ubuntu-M.ttf'),
 ('Ubuntu Mono derivative Powerline',
  '/home/momukji/.local/share/fonts/Ubuntu Mono derivative Powerline Italic.ttf'),
 ('Ubuntu', '/usr/share/fonts/truetype/ubuntu/Ubuntu-B.ttf'),
 ('Ubuntu Mono', '/usr/share/fonts/truetype/ubuntu/UbuntuMono-RI.ttf'),
 ('Ubuntu Mono derivative Powerline',
  '/home/momukji/.local/share/fonts/Ubuntu Mono derivative Powerline Bold.ttf'),
 ('Ubuntu Mono derivative Powerline',
  '/home/momukji/.local/share/fonts/Ubuntu Mono derivative Powerline Bold Italic.ttf'),
 ('Ubuntu', '/usr/share/fonts/truetype/ubuntu/Ubuntu-L.ttf'),
 ('Ubuntu', '/usr/share/fonts/truetype/ubuntu/Ubuntu-RI.ttf'),
 ('Ubuntu', '/home/momukji/.fonts/Ubuntu.ttf'),
 ('Ubuntu Mono', '/usr/share/fonts/truetype/ubuntu/UbuntuMono-BI.ttf'),
 ('Ubuntu Condensed', '/usr/share/fonts/truetype/ubuntu/Ubuntu-C.ttf')]

In [3]:

# 시스템 설치된 폰트파일 찾는 함수
font_list = fm.findSystemFonts(fontpaths=None, fontext='ttf')
# font_list_mac = fm.OSXInstalledFonts()  # Mac 에서 폰트파일 찾기
# print(len(font_list))

# 폰트파일에서 원하는 이름 필터링
def font_name_check(check_name, font_list=None):
    font_titles = [_.split('/')[-1].split(".")[0]  for _ in font_list]
    for _ in font_titles:
        if _.lower().find(check_name.lower()) != -1: print(_)

font_name_check("ubuntu", font_list)

Ubuntu-M
UbuntuMono-BI
Ubuntu Mono derivative Powerline Bold Italic
Ubuntu Mono derivative Powerline
Ubuntu-RI
Ubuntu-B
Ubuntu-LI
UbuntuMono-B
Ubuntu-C
Ubuntu
Ubuntu-BI
Ubuntu Mono derivative Powerline Italic
Ubuntu-R
Ubuntu-MI
Ubuntu-L
Ubuntu Mono derivative Powerline Bold
UbuntuMono-R
UbuntuMono-RI

1 sheet Names¶

엑셀내 시트내용 살펴보기

In [1]:

import xlrd, re
xls_file = r'backup/menuData_muyong.xlsx'
xls = xlrd.open_workbook(xls_file, on_demand=True)
sheetName = xls.sheet_names()
print(sheetName[:5])

from tqdm import tqdm
shtList = []
for _ in tqdm(sheetName):
    chk = re.findall(r"(\d+).(\d+)[-.](\d+)",_)
    if chk:
        if len("".join(chk[0])) < 5: print(_)
        else: shtList.append(_)
    else: print(_)
        
len(shtList)

100%|██████████| 253/253 [00:00<00:00, 237821.36it/s]

['업체별단가 및 사모님이 사오시는 품목', '사용량', '18년1.16-24', '18년2.20-', '18.03.20']
업체별단가 및 사모님이 사오시는 품목
사용량

Out[1]:

In [2]:

# merge cell 때문에 전체가 보이질 않는다 ㅠㅠ..
import pandas as pd
pd.read_excel(xls_file, sheetName=shtList[5]).head(2)

Out[2]:

	품명	신용축산	비앤피	사모님이 사오시는품목
0	NaN	031 426 8833	010 3805 4938	NaN
1	목전지(제육)	4800	4700	굴소스

2 lambda 수식의 활용¶

큰 데이터를 처리시, for 문에 비해 훨씬 효과가 빠르다

In [3]:

# lambda (Parametor : Return) 계산
add = lambda x, y: x + y
add(3, 5)

Out[3]:

In [4]:

# sort() 에서 lambda 정렬
a = [(1, 2), (4, 1), (9, 10), (13, -3)]
a.sort(key=lambda x: x[1])
a

Out[4]:

[(13, -3), (4, 1), (1, 2), (9, 10)]

In [5]:

list1, list2 = (1,2,3,4,5), (10,20,30,40,50)
data = zip(list1, list2)
data

Out[5]:

<zip at 0x7f2debcd8548>

In [6]:

# data.sort()  # .sort는 오류를 발생한다
# data
list1, list2 = map(lambda t: list(t), zip(*data))
print("list1 : {}\nlist2 : {}".format(list1, list2))

list1 : [1, 2, 3, 4, 5]
list2 : [10, 20, 30, 40, 50]

In [7]:

# itertools 을 활용
list_1 = [1,5,4]
list_2 = [2,3,4]

# using list comprehensions
import itertools
comparisons = [a == b for (a, b) in itertools.product(list_1, list_2)]
sums        = [a + b  for (a, b) in itertools.product(list_1, list_2)]
print("값 비교 : {}\n값의 합 : {}".format(comparisons,sums))

값 비교 : [False, False, False, False, False, False, False, False, True]
값의 합 : [3, 4, 5, 7, 8, 9, 6, 7, 8]

In [8]:

lst1 = [1,5,4]
lst2 = [2,3,4]
lst3 = list(map(lambda x: x[0] == x[1] , itertools.product(lst1,lst2)))
print('lst3 Row data : {}\nCounting "True" data : {}'.format(lst3, sum(lst3)))

lst3 Row data : [False, False, False, False, False, False, False, False, True]
Counting "True" data : 1

Process Goal¶

여러가지 작업방법을 사용하여 원하는 사전을 구축하는 것!!!
결국은 사전을 구축하며 내용을 확인 가능해야 Huristic 개념에서 Algorithm 으로 넘어갈 수 있다
음식과 메뉴 관련된 단어들만 추출가능하도록 모듈 구성하기
Token 들 중 음식과 관련된 단어들만 찾기
유효한 값 2000개 중 음식과 관련 없는 것들만 제거하기

1 NLP 전처리¶

문장과 단어의 유사도 및 성격 분석및 측정하기

In [9]:

text1 = "자연 언어에 대한 연구는 오래전부터 이어져 오고 있음에도 2018년까지도 사람처럼 이해하지는 못한다.".split()
text2 = "자연 언어에 대한 연구는 오래전부터 이어져 들어서도 아직 컴퓨터가 사람처럼 이해하지는 못한다.".split()
text3 = "자연 아직 컴퓨터가 언어에 들어서도 못한다 이어져 사람처럼 이해하지는 대한 연구는 오래전부터.".split()
len(text1), len(text2), len(text3)

Out[9]:

(12, 12, 12)

In [10]:

from nltk.metrics import edit_distance
print(edit_distance('파이썬 알고리즘', '파파미 알탕'))

print('생략된 단어가 다를 때 : {} \n어휘 순서를 바꿨을 때 : {}'.format(
    edit_distance(text1, text2), 
    edit_distance(text2, text3)))

5
생략된 단어가 다를 때 : 3 
어휘 순서를 바꿨을 때 : 10

In [11]:

# 02 accuracy 정확도 측정
from nltk.metrics import accuracy
accuracy('파이썬', '파이프')

Out[11]:

0.6666666666666666

In [12]:

print('생략된 단어가 다를 때 {:.4} \n어휘 순서를 바꿨을 때 {:.4}'.format(
    accuracy(text1, text2), 
    accuracy(text2, text3)))

생략된 단어가 다를 때 0.75 
어휘 순서를 바꿨을 때 0.08333

In [13]:

text1 = set(text1)
text2 = set(text2)
text3 = set(text3)
print(len(text1), len(text2), len(text3))
from nltk.metrics import precision
precision({'파이썬'}, {'파르썬'})

12 12 12

Out[13]:

0.0

In [14]:

print('생략된 단어가 다를 때 {:.4} \n어휘 순서를 바꿨을 때 {:.4}'.format(
    precision(set(text1), set(text2)), 
    precision(set(text2), set(text3))))

생략된 단어가 다를 때 0.75 
어휘 순서를 바꿨을 때 0.8333

In [15]:

from nltk.metrics import recall
print('생략된 단어가 다를 때 {:.4} \n어휘 순서를 바꿨을 때 {:.4}'.format(
    recall(text1, text2), 
    recall(text2, text3)))

생략된 단어가 다를 때 0.75 
어휘 순서를 바꿨을 때 0.8333

2 데이터 불러오기¶

음식데이터는 Mecab() 보다 Okt() 가 더 효과적인 단어목록을 포함

제목 단어들을 묶어서 word Piece Model 로 유효단어 추출하기
유사한 그룹은 편집거리를 활용하여 묶기

In [16]:

import pandas as pd
df = pd.read_csv('data/menu_1000recipe.csv', sep="|")
df = df.fillna("") # NaN 데이터 전처리를 해야 다음 과정들이 진행
titles = df.Menu.tolist()
titles = " ".join(titles)

# with open('menuTitles.txt', 'w') as f:
#     f.write(menuTitle)
titles[:200]

Out[16]:

'호불호 없는 토마토파스타는 역시 ~! 토마토냉파스타 ★ 달콤 촉촉 달걀 푸딩 만들기, 만드는 법 ( 부드러운 달걀 요리 기본 원리 ) 손이가요 손이가~~자꾸 손이 가는 [가지전] 여름 무로 깔끔아삭 무채무침 흰 강낭콩 바나나 쉐이크 다이어트&아침 대용 깻잎절임 : 입맛 되살리는 초간단 반찬 소고기 장조림 만들기 간단반찬 으로 최고! 차돌박이 부추무침, 간단'

In [17]:

import re
titles = re.findall(r"[가-힣]+", titles)
titles = " ".join(titles) #.lower()
titles_raw = titles.split(" ")

from collections import Counter
Counter(titles_raw).most_common(10)

Out[17]:

[('만들기', 17314),
 ('만드는', 5224),
 ('법', 3834),
 ('회', 3741),
 ('초간단', 3565),
 ('레시피', 3394),
 ('만드는법', 2753),
 ('맛있는', 2655),
 ('좋은', 2387),
 ('맛있게', 2212)]

3 Nltk 객체를 활용한 가치있는 단어들 추출¶

# nltk 에서 분석가능한 객체를 자동으로 생성
# 식품메뉴명에는 Okt() 가 사전이 더 좋음
titles = Okt().pos(titles, stem=True)

In [18]:

%%time
from nltk import Text
from konlpy.tag import Mecab, Okt
titles_noun = Mecab().nouns(titles)
title_obj   = Text(titles_noun)

# .B() : Unique 한 Token 갯수
# .N() : 전체 Token 의 갯수
title_obj.vocab().N(),  title_obj.vocab().B(),\
int(title_obj.vocab().N()/title_obj.vocab().B())

CPU times: user 3.18 s, sys: 774 ms, total: 3.95 s
Wall time: 4.33 s

Out[18]:

(502842, 12354, 40)

In [19]:

from nltk import collocations
finder   = collocations.BigramCollocationFinder.from_words(titles_noun)
measures = collocations.BigramAssocMeasures()
finder.nbest(measures.pmi, 10)

Out[19]:

[('감전', '호구전'),
 ('갯벌', '환경'),
 ('거성', '그룹'),
 ('검사', '외조'),
 ('계량스푼', '계량컵'),
 ('고준희', '대사'),
 ('곤봉', '약봉'),
 ('구내', '염아'),
 ('국제', '캠퍼스'),
 ('굵기', '일정')]

In [51]:

%matplotlib inline
from matplotlib import rc, rcParams
rc('font', family=['UbuntuMono-R', 'D2Coding','NanumGothic','Malgun Gothic']) # 한글의 표시
rcParams['axes.unicode_minus'] = False             # '-' 표시의 처리

import matplotlib.pyplot as plt
plt.figure(figsize=(15, 3))  # 파레트 설정
title_obj.plot(70)

In [14]:

maxLimits = 1950
title_obj.vocab().most_common(maxLimits)[(maxLimits-10):]

Out[14]:

[('월병', 26),
 ('벨기에', 26),
 ('뒷다리', 26),
 ('웰치', 26),
 ('필라프', 25),
 ('술빵', 25),
 ('여기', 25),
 ('콩물', 25),
 ('청양', 25),
 ('널', 25)]

In [15]:

" ".join([_[0] for _ in title_obj.vocab().most_common(100)])

Out[15]:

'만들기 볶음 법 요리 레시피 밥 간단 초 조림 감자 김치 맛 회 반찬 치즈 샐러드 두부 구이 매콤 백종원 전 간식 소고기 새우 오징어 볶음밥 샌드위치 황금 닭 고구마 나물 김밥 참치 간장 어묵 가지 집 불고기 계란 돼지고기 야채 소스 찜 콩나물 고추장 아이 파스타 찌개 피자 오이 살 고추 떡볶이 밑반찬 튀김 마늘 집밥 백선생 국 딸 쿠키 스테이크 멸치 음식 국물 탕 버섯 양념 부추 무 크림 치킨 토마토 떡 만두 토스트 잡채 향 냉장고 달걀 버터 카레 삼겹살 가슴 오븐 라면 말 장조림 베이컨 된장 도시락 영양 빵 그릇 케이크 식빵 방법 봄 여름 알'

4 N Gram 을 활용한 PMI 분석¶

nltk tools

음식 이외의 정보들도 포함 되어 있어서 결과해석이 용이하지 못함
음식을 직접 언급한 문장 만으로 분석하는 등의 별도 과정이 필요

In [20]:

from nltk import collocations
finder   = collocations.BigramCollocationFinder.from_words(titles_raw)
measures = collocations.BigramAssocMeasures()
finder.nbest(measures.pmi, 4)

Out[20]:

[('가까이', '왔네요'), ('가끔씩은', '초식도'), ('가나슈로', '아이싱한'), ('가나슈마카롱', '씨앗이')]

In [21]:

finder = collocations.BigramCollocationFinder.from_words(titles_noun)
measures = collocations.BigramAssocMeasures()
finder.nbest(measures.pmi, 4)

Out[21]:

[('가까이', '쑥굴국'), ('감탄사', '연발'), ('갑농', '산이'), ('개두', '릎')]

In [22]:

finder   = collocations.TrigramCollocationFinder.from_words(titles_raw)
measures = collocations.TrigramAssocMeasures()
finder.nbest(measures.pmi, 4)

Out[22]:

[('가까이', '왔네요', '쑥굴국'),
 ('가스레인지추천', '국내최초', '라면자동요리'),
 ('가스불이', '안켜져', '난감한데'),
 ('가슴속에', '차오르는', '연')]

In [23]:

finder   = collocations.TrigramCollocationFinder.from_words(titles_noun)
measures = collocations.TrigramAssocMeasures()
finder.nbest(measures.pmi, 4)

Out[23]:

[('갯벌', '환경', '지표'),
 ('거성', '그룹', '구일'),
 ('계정', '바오지', '딩'),
 ('궁보', '계정', '바오지')]