정규표현식 연습하기¶

1 잡아라 텍스트 마이닝 with 파이썬¶

Regex Tutorials

In [1]:

# 문장내 (도메인 주소) 부분 제거하기
string_data = """그러면서 (taken@kookmin.com) "문재인 대통령은 조국 장관 임명으로 민심을 분열시킨 책임을 엄중히 느끼고 
제대로 된 개혁 작업을 추진하기를 바란다"며 "특히 검찰개혁이 사법개혁의 전체인양 호도하지 말고 근본적인 
사법개혁을 고민하기를 바란다"고 강조했다. (bright@newsis.com)"""
token_str   = "\([A-z0-9\._+]*@[A-z]+\.(com|org|net|edu|co.kr)\)"

import re
re.sub(token_str, "", string_data)

Out[1]:

'그러면서  "문재인 대통령은 조국 장관 임명으로 민심을 분열시킨 책임을 엄중히 느끼고 \n제대로 된 개혁 작업을 추진하기를 바란다"며 "특히 검찰개혁이 사법개혁의 전체인양 호도하지 말고 근본적인 \n사법개혁을 고민하기를 바란다"고 강조했다. '

In [2]:

# .findall 을 사용하면 결과를 찾지 못함
# 제대로 찾지 못한 경우
re.findall(token_str, string_data)

Out[2]:

['com', 'com']

In [3]:

# .search 를 사용하면 제대로 된 결과 1개만 출력
re.search(token_str, string_data).group()

Out[3]:

'(taken@kookmin.com)'

In [4]:

# .finditer 를 사용하면 여러개를 찾는다
# https://stackoverflow.com/questions/8110059/python-regex-search-and-findall
for match in re.finditer(token_str, string_data):
    print (match.group(0))

(taken@kookmin.com)
(bright@newsis.com)

In [5]:

# 1개(+) 와 0개(*) 의 포함여부 차이
tokenizer = re.compile("a+b*")
tokenizer.findall("aaaa, ccc, bbbb, aabbbb")

Out[5]:

['aaaa', 'aabbbb']

In [6]:

# 시작조건 특정
tokenizer = re.compile("^a..")
tokenizer.findall("abc, cba")

Out[6]:

['abc']

In [7]:

# 일치 갯수의 특정
tokenizer = re.compile("a{2,3}b{2,3}")
tokenizer.findall("aabb, aaabb, ab, aab")

Out[7]:

['aabb', 'aaabb']

2 김기현의 딥러닝 자연어 분석¶

Regex Tutorials

In [8]:

# 객체의 치환
import re
p = re.compile(r"(?P<name>\w+)\s+(?P<phone>(\d+)[-]\d+[-]\d+)")
print(p.sub("\g<phone> \g<name>", "park 010-1234-1234"))

p = re.compile(r"(?P<name>\w+)\s+(?P<phone>(\d+)[-]\d+[-]\d+)")
print(p.sub("\g<2> \g<1>", "park 010-1234-1234"))

dot_to_nun = re.compile(r"(?P<num>(\d+)).(?P<dot>(\d+))")
dot_to_nun.sub("\g<num>=\g<dot>", "1232.3124")

010-1234-1234 park
010-1234-1234 park

Out[8]:

'1232=3124'

In [9]:

import re
re.findall('.+', "sales.xls, test.xls")

Out[9]:

['sales.xls, test.xls']

In [10]:

re.findall('(<br>|<br/>)', "<br> 텍스트 <br/>")

Out[10]:

['<br>', '<br/>']

In [11]:

import re
p = re.compile(r"(?P<name>\w+)\s+(?P<phone>(\d+)[-]\d+[-]\d+)")
print(p.sub("\g<phone> \g<name>", "park 010-1234-1234"))

010-1234-1234 park

In [12]:

text = "$23.24  의 가격은 $69.23 까지 상승하였습니다"
p = re.compile(r"\$[0-9.]+")
p.findall(text)

Out[12]:

['$23.24', '$69.23']

In [13]:

p = re.compile(r"(?<=\$)[0-9.]+")
p.findall(text)

Out[13]:

['23.24', '69.23']

Regex 정규식의 사용¶

1 Re 모듈의 기본 사용법¶

re.findall() : 조건에 해당되는 모든 객체 추출
re.search() : () 그룹 검색결과 추출
re.sub() : 문자열 교체

In [1]:

import re
def basicregex():
    contactInfo = 'Doe, John: 1111-1212'
    line = "This is test sentence and test sentence is also a sentence."
    findallobj = re.findall(r'sentence', line)                    
    print (">>> re.findall() \n{}\n".format(findallobj))
    
    groupwiseobj = re.search(r'(\w+), (\w+): (\S+)', contactInfo) 
    print (">>> re.groups \n1st group: {}\n2nd group: {}\n3rd group: {}".format(
        groupwiseobj.group(1), groupwiseobj.group(2), groupwiseobj.group(3)))
    
    phone = "1111-2222-3333 # This is Phone Number"
    num = re.sub(r'#.*$', "", phone)
    print ("\n>>> re.sub()\nPhone Num : {}".format(num))

    contactInforevised = re.sub(r'John', "Peter", contactInfo)
    print ("\n>>> Revised contactINFO : ", contactInforevised)
    
basicregex()

>>> re.findall() 
['sentence', 'sentence', 'sentence']

>>> re.groups 
1st group: Doe
2nd group: John
3rd group: 1111-1212

>>> re.sub()
Phone Num : 1111-2222-3333 

>>> Revised contactINFO :  Doe, Peter: 1111-1212

2 .match(), .search()¶

.match() : 문자열 시작부분의 일치여부 확인
.search() : 문자열 아무 곳 일치여부 확인

In [2]:

import re
def searchvsmatch():
    line     = "I love animals.";
    matchObj = re.match(r'animals', line, re.M | re.I)
    if matchObj:
        print ("match: ", matchObj.group())
    else:
        print ("No 're.match'!!")

    searchObj = re.search(r'animals', line, re.M | re.I)
    if searchObj:
        print ("re.search: ", searchObj.group())
    else:
        print ("Nothing found!!")

searchvsmatch()

No 're.match'!!
re.search:  animals

3 Re 플래그 및 사용법¶

기본 플래그 : 정규식 검색시 설정값을 추가할 수 있다
re 플래그 사용예제

In [3]:

import re
m1 = 'about = 12345 \n b = 444 \t c=100'
p1 = r'([a-zA-Z]\w*)\s*=\s*(\d+)'
re.findall(p1, m1)

Out[3]:

[('about', '12345'), ('b', '444'), ('c', '100')]

In [4]:

cp = re.compile(p1)
m  = re.match(cp,m1)
m.groups() # m.group(1), m.group(2)

Out[4]:

('about', '12345')

In [5]:

# re.I : 대소문자 무시
m1 = 'tTh cat was hungry, they were scare because of The cat'
p1 = re.compile('the')
p2 = re.compile('the', re.I)

r1 = re.findall(p1, m1)
r2 = re.findall(p2, m1)
r1, r2

Out[5]:

(['the'], ['the', 'The'])

In [6]:

# re.M : 각 줄마다 반복
m1 = '''tTh cat was hungry,
they were scare because of The cat'''
p1 = re.compile('^[a-zA-Z]\w+', re.I|re.M)
r1 = re.findall(p1, m1)
r1

Out[6]:

['tTh', 'they']

In [7]:

# re.S : 모든 문자에 \n 등도 포함
m1 = '''tTh cat was hungry,
 they were scare because of The cat'''
p1 = re.compile('hungry,.*they', re.I)
p2 = re.compile('hungry,.*they', re.I|re.S)

r1 = re.findall(p1, m1)
r2 = re.findall(p2, m1)
r1, r2

Out[7]:

([], ['hungry,\n they'])

In [8]:

# re.X : 문자열이 아닌 문장을 정규식을 할 수 있다.
s = '''<html>\n<head>\n<title>title</title>\n
<body>THis is body<a href='spam.html'>spam</a>
</body>\n</head>\n</html>
'''
p = r'''\s*.*\s*.*'''
r = re.match(p, s, re.X)
r.group()

Out[8]:

'<html>\n<head>'

4 고급 레벨 정규표현식¶

(?=re패턴) Positive LookAhead : 정의된 패턴 앞의 문자열을 추출
(?<=re패턴) Positive LookBehind : 정의된 패턴 다음의 문자열을 추출
(?!re패턴) Negative LookAhead : 정의된 패턴을 따르지 않는 앞의 문자열을 추출
(?<!re패턴) Negative LookBehind : 정의된 패턴을 따르지 않는 다음의 문자열을 추출

In [9]:

# 일치하는 문자열 내부의 부분을 추출 합니다
re.findall(r'play(?=ground)', "on the playground")

Out[9]:

['play']

In [10]:

re.search(r'play(?=ground)', "on the playground").group()

Out[10]:

'play'

In [11]:

# 고급 정규식 예제들 함수로 정리
def advanceregex(text):
    print('::"', text, '"::\n')
    positivelookaheadobjpattern = re.findall(r'play(?=ground)',text,re.M | re.I)
    print ("Positive lookahead: " + str(positivelookaheadobjpattern))
    positivelookaheadobj = re.search(r'play(?=ground)',text,re.M | re.I)
    print ("'play(?=ground)' character index: "+ str(positivelookaheadobj.span()))

    possitivelookbehindobjpattern = re.findall(r'(?<=play)ground',text,re.M | re.I)
    print ("\nPositive lookbehind: " + str(possitivelookbehindobjpattern))
    possitivelookbehindobj = re.search(r'(?<=play)ground',text,re.M | re.I)
    print ("'(?<=play)ground' character index: " + str(possitivelookbehindobj.span()))

    negativelookaheadobjpattern = re.findall(r'play(?!ground)', text, re.M | re.I)
    print ("\nNegative lookahead: " + str(negativelookaheadobjpattern))
    negativelookaheadobj = re.search(r'play(?!ground)', text, re.M | re.I)
    print ("'play(?!ground)' character index: " + str(negativelookaheadobj.span()))

    negativelookbehindobjpattern = re.findall(r'(?<!play)ground', text, re.M | re.I)
    print ("\nnegative lookbehind: " + str(negativelookbehindobjpattern))
    negativelookbehindobj = re.search(r'(?<!play)ground', text, re.M | re.I)
    print ("'(?<!play)ground' character index: " + str(negativelookbehindobj.span()))

text = "I play on playground. It is the bestground."
advanceregex(text)

::" I play on playground. It is the bestground. "::

Positive lookahead: ['play']
'play(?=ground)' character index: (10, 14)

Positive lookbehind: ['ground']
'(?<=play)ground' character index: (14, 20)

Negative lookahead: ['play']
'play(?!ground)' character index: (2, 6)

negative lookbehind: ['ground']
'(?<!play)ground' character index: (36, 42)

5 철자의 교정¶

최소편집거리 (Minimum edit Distance) 알고리즘의 활용

In [12]:

import re
from collections import Counter
def words(text):
    return re.findall(r'\w+', text.lower())

# 셜록홈즈 소설 : big.txt
WORDS = Counter(words(open('./data/big.txt').read()))
WORDS.most_common(5), WORDS["able"], sum(WORDS.values())

Out[12]:

([('the', 79809), ('of', 40024), ('and', 38312), ('to', 28765), ('in', 22023)],
 201,
 1115585)

In [13]:

# "word" 단어 token 등장비율의 계산
def P(word, N=sum(WORDS.values())): 
    return WORDS[word] / N  

# 단어의 음절 Token (WORDS 포함단어만) : known('aple') => {'a','e','l','p'}
def known(words):
    return set(w for w in words if w in WORDS)

known('aple'), P('able')

Out[13]:

({'a', 'e', 'l', 'p'}, 0.0001801745272659636)

In [14]:

# 단어 편집거리 측정함수
def edits1(word):
    a_z        = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i],word[i:]) for i in range(len(word)+1)]     # 단어의 분리
    deletes    = [L+R[1:]             for L,R in splits  if R]         # 예제1) 1개단어 없는 예
    transposes = [L+R[1]+R[0]+R[2:]   for L,R in splits  if len(R)>1]  # 예제2) 중간단어 섞인 예
    replaces   = [L+c+R[1:]           for L,R in splits  if R for c in a_z] # 예제3) a-z 1개씩 변경 예
    inserts    = [L+c+R               for L,R in splits  for c in a_z] # 예제4) a-z 1개씩 추가 예
    return set(deletes+transposes+replaces+inserts) # 예제1 ~ 예제4 모두 생성

list(edits1('aple'))[:7]

Out[14]:

['waple', 'aplre', 'apje', 'apkle', 'uple', 'baple', 'ahple']

In [15]:

# "All edits that are two edits away from `word`."
def edits2(word): 
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))
edits2('aple')

Out[15]:

<generator object edits2.<locals>.<genexpr> at 0x7f9b0967d518>

In [16]:

# "Generate possible spelling corrections for word."
def candidates(word): 
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

# "Most probable spelling correction for word."
def correction(word): 
    return max(candidates(word), key=P)

print (correction('aple'))
print (correction('correcton'))
print (correction('statament'))
print (correction('tutpore'))
print (correction('datactive')) # 갯수는 동일하지만 철자만 다른경우 정확도 높음

able
correction
statement
tutor
detective

6 Word Piece Model 한글실습¶

defaultDict 단어 token 을 만든 뒤,
Bi-Gram 빈도를 기준으로 단어 Token 을 찾기

In [17]:

# 단어 Token 을 음절단위로 구분하기
data = [" ".join(list(_)+["</w>"])  
        for _ in ["아버지께", "아버지를", "아버지에게", "아버지와"]]
data

Out[17]:

['아 버 지 께 </w>', '아 버 지 를 </w>', '아 버 지 에 게 </w>', '아 버 지 와 </w>']

In [18]:

data = {
    "아 버 지 께 </w>":5,
    "아 버 지 를 </w>":2,
    "아 버 지 에 게 </w>":6,
    "아 버 지 와 </w>":3,
}
data.values()

Out[18]:

dict_values([5, 2, 6, 3])

In [19]:

from collections import defaultdict
# N-Gram 생성함수
def findNagram(data):
    pair = defaultdict(int)
    for k, v in data.items():
        tokens = k.split()
        for i in range(len(tokens)-1):
            pair[tuple(tokens[i:i+2])] += v
    return pair

# maxValue 변수를 추가해 반복 횟수를 제한 합니다
def mergeNgram(maxKey, data):
    newData = dict()
    for k, v in data.items():
        newKey = re.sub(" ".join(maxKey),
                        "".join(maxKey), k)
        newData[newKey] = v
    return newData

maxValue = max(data.values())
maxValue

Out[19]:

In [20]:

# 작업을 반복하며 고유한 단어를 찾습니다
for _ in range(1000):
    pairList = findNagram(data)
    maxKey   = max(pairList, key=pairList.get)
    #print(maxKey, pairList[maxKey])
    
    if pairList[maxKey] > maxValue:
        data = mergeNgram(maxKey, data)
    else: 
        break
print(data)
pairList

{'아버지 께 </w>': 5, '아버지 를 </w>': 2, '아버지 에 게 </w>': 6, '아버지 와 </w>': 3}

Out[20]:

defaultdict(int,
            {('아버지', '께'): 5,
             ('께', '</w>'): 5,
             ('아버지', '를'): 2,
             ('를', '</w>'): 2,
             ('아버지', '에'): 6,
             ('에', '게'): 6,
             ('게', '</w>'): 6,
             ('아버지', '와'): 3,
             ('와', '</w>'): 3})

손에 잡히는 정규표현식¶

1 문자 다루기¶

In [21]:

sample = """저의 이름은 벤 입니다. 
홈페이지 주소는 https://www.forta.com 입니다
<a href='https://www.forta.com'></a> """

import re
re.findall('홈페이지', sample)

Out[21]:

['홈페이지']

In [22]:

# '.' : 문자, 알파벳, 숫자, 문장부호 모두포함 (공백도 포함)
re.findall('.', sample)[:15]

Out[22]:

['저', '의', ' ', '이', '름', '은', ' ', '벤', ' ', '입', '니', '다', '.', ' ', '홈']

In [23]:

# '..' : 공백포함 2개 단어목록을 추출
re.findall('..', sample)[:9]

Out[23]:

['저의', ' 이', '름은', ' 벤', ' 입', '니다', '. ', '홈페', '이지']

In [24]:

# 연산기호가 아닌 순수한 "." 을 추출합니다.
re.findall('\.', sample)

Out[24]:

['.', '.', '.', '.', '.']

In [25]:

re.findall('\n', sample)

Out[25]:

['\n', '\n']

In [26]:

re.findall('[A-z가-힣]*', sample)[:10]

Out[26]:

['저의', '', '이름은', '', '벤', '', '입니다', '', '', '']

In [27]:

# '.' 을 포함한 문자
re.findall('[A-z.]+', sample)

Out[27]:

['.', 'https', 'www.forta.com', 'a', 'href', 'https', 'www.forta.com', 'a']

2 문자 집합으로 찾기¶

Grouping

In [28]:

sample = """문장에서 '정규식의 활용'을 
파이썬 이외에 다양한 언어에서도 활용 가능합니다"""

sample = """이번 주문은 처리가 가능합니다.
단 거주지가 <b>서울</b> 과 <b>경기도</b> 에 제한됩니다"""

# .* : 탐욕적 수량자 (조건에 포함되는 큰 덩어리를 추출)
re.findall('<b>.*</b>', sample)[0]

Out[28]:

'<b>서울</b> 과 <b>경기도</b>'

In [29]:

# .*? : 게으른 수량자 (조건에 포함되는 작은 덩어리들을 추출)
re.findall('<b>.*?</b>', sample)

Out[29]:

['<b>서울</b>', '<b>경기도</b>']

In [30]:

sample = """고양이 는 강이지와 함께 고양이 새끼들을 키운다"""
# sample = """The cat scattered his food all over the room."""
re.findall('[\b고양이\b]', sample)
re.findall('고양이', sample)

Out[30]:

['고양이', '고양이']

In [31]:

re.findall('[\bcat\b]', sample)

Out[31]:

[]

In [32]:

sample = """ http://www.daun.net 또는 https://www.naver.com"""
re.findall('https?://www.*?', sample)

Out[32]:

['http://www', 'https://www']

In [33]:

sample = """ 1999-12-11 년도의 작업이 2019-01-11 작업을 진행합니다 """
re.findall('(19|20)[0-9]+', sample)

Out[33]:

['19', '20']

In [34]:

sample = """  010-9999-3333  010-5555-3333  010-5554-2351"""
re.sub('(\d{3})(-)(\d{3,4})(-)(\d{3,4})', '\g<0>', sample)

Out[34]:

'  010-9999-3333  010-5555-3333  010-5554-2351'

In [35]:

# "\g<인덱스> : 앞에서 추출한 객체들을 재활용시 인덱스값 활용"
re.subn('(\d{3})(-)(\d{3,4})(-)(\d{3,4})', '\g<1>:\g<3>:\g<5>', sample)

Out[35]:

('  010:9999:3333  010:5555:3333  010:5554:2351', 3)

In [36]:

# 객체 재활용시 "인덱스 이름"을 활용
re.subn('(?P<head>\d{3})(-)(?P<mid>\d{3,4})(-)(?P<tail>\d{3,4})', '\g<head>:\g<mid>:\g<tail>', sample)

Out[36]:

('  010:9999:3333  010:5555:3333  010:5554:2351', 3)

In [37]:

sample = """  <tiTle>제목을 입력 합니다. </TitlE> """
re.findall('<[titleTITLE]+>.*<[/titleTITLE]+>', sample)

Out[37]:

['<tiTle>제목을 입력 합니다. </TitlE>']

실전 블로그 작업 시작하기¶

식재료 목록 (원본데이터를 그대로 활용하기)
식품 목록 (한식메뉴 목록 데이터)

In [38]:

sample = """ 
    재료: 돼지고기, 파, 양파, 호박, 양배추, 전분, 면
    양념: 춘장, 식용유, 간장, 설탕
"""
re.findall('된장국*?\w+', sample)

Out[38]:

[]

In [39]:

# 파일이름 필터링
sample = "na1.xls na2.xls sa2.xls sam.xls sandwich.xls sax.xls"

import re
re.findall('[ns]a[^0-9]\.xls', sample)

Out[39]:

['sam.xls', 'sax.xls']