1 Xpath 관련 문법들¶

01 경로 연산자¶

문법	설명
/	루트 노드(문서의 시작), 재귀적으로 탐색하지 않음.
//	지정한 노드에서 재귀적으로 탐색.
.	현재 노드를 선택
..	현재 노드의 부모 노드를 선택
@	속성 노드를 선택 (HTML에서 attribute지정)

In [1]:

xpath = '//*[@id="mArticle"]/div[2]/ul[2]/li[1]/div[2]/div/span/text()'
xpath

Out[1]:

'//*[@id="mArticle"]/div[2]/ul[2]/li[1]/div[2]/div/span/text()'

In [2]:

html_code = """
<html>
  <body>
    <h1 class="text-muted">내가 가장 선호하는 라이브러리 Favorite Python Librarires</h1>
    <ul class="nav nav-pills nav-stacked">
      <li role="presentation"><a href="http://www.numpy.org/">넘파이 Numpy</a></li>
      <li role="presentation"><a href="http://pandas.pydata.org/">판다스 Pandas</a></li>
      <li role="presentation"><a href="http://python-requests.org/">리퀘스트 requests</a></li>
    </ul>
    <h1 class="text-success">Favorite JS Librarires</h1>
    <ul class="nav nav-tabs">
      <li role="presentation"><a href="http://getbootstrap.com/">부트스트랩 Bootstrap</a></li>
      <li role="presentation"><a href="https://jquery.com/">제이쿼리 jQuery</a></li>
      <li role="presentation"><a href="http://d3js.org/">d3.js</a></li>
    </ul>
</html>"""

html_code = html_code.replace("\\n","\n")
from lxml.html import tostring, fromstring, HTMLParser
doc = fromstring(html_code)
doc

Out[2]:

<Element html at 0x7fbc0c2039a0>

In [3]:

# .text : 대상객체의 text 내용
# .tag : 대상객체의 html 태그
# .attrib : 대상객체의 속성내용 dict()
print(doc.xpath("/html/body/h1/text()")[0])
title = doc.xpath("/html/body/h1")[0]
title, title.text, title.tag, title.attrib

내가 가장 선호하는 라이브러리 Favorite Python Librarires

Out[3]:

(<Element h1 at 0x7fbc0c1615e0>,
 '내가 가장 선호하는 라이브러리 Favorite Python Librarires',
 'h1',
 {'class': 'text-muted'})

In [4]:

# Xpath와 일치하는 객체가 3개 발견되었다.
item_list = doc.xpath("/html/body/ul/li")
print(item_list)

# 호출 가능한 3개 객체를 `text()` 문법으로 호출하기
# "/html/body/ul/li/text()" 를 생략한 일부분의 입력만으로 출력가능
doc = fromstring(html_code)
item_list1 = doc.xpath("/html/body/ul/li/text()")
item_list2 = doc.xpath("//li/text()")  
item_list1 == item_list2

[<Element li at 0x7fbc0c161db0>, <Element li at 0x7fbc0c161e00>, <Element li at 0x7fbc0c161e50>, <Element li at 0x7fbc0c161ea0>, <Element li at 0x7fbc0c1645e0>, <Element li at 0x7fbc0c164630>]

Out[4]:

True

In [5]:

title = doc.xpath("/html/body/h1[@class='text-muted']/text()")[0]
title

Out[5]:

'내가 가장 선호하는 라이브러리 Favorite Python Librarires'

In [6]:

# Text 추출
item_list = doc.xpath("/html/body/ul[contains(@class,'nav-stacked')]/li/a/text()")
item_list

Out[6]:

['넘파이 Numpy', '판다스 Pandas', '리퀘스트 requests']

In [9]:

# 속성 내용의 추출
item_list1 = doc.xpath("/html/body/ul[contains(@class,'nav-stacked')]/li/a/@href")
item_list2 = doc.xpath("/html/body/ul[contains(@class,'nav-stacked')]/li/a")
item_list2 = list(map(lambda x : x.get('href'), item_list2))
print(item_list1 == item_list2)
item_list2

True

Out[9]:

['http://www.numpy.org/',
 'http://pandas.pydata.org/',
 'http://python-requests.org/']

03 방향 연산자¶

방향연산자::경로연산자[필터표현식]

문법	설명
self	현재 노드
attribute	현재 노드의 속성 노드
namespace	현재 노드의 네임스페이스 노드
child	현재 노드의 자식 노드
descendant	현재 노드의 자손 노드 (자식 -> 자손)
descendant-or-self	현재 노드와 자손 노드
following	현재 노드의 종료 이후 등장하는 노드
following-sibling	현재 노드 이후 등장하는 형제 노드
parent	현재 노드의 부모 노드
ancestor	현재 노드의 조상 노드
ancestor-or-self	현재 노드와 조상 노드
preceding	현재 노드 이전의 모든 노드(조상,속성, 네임스페이스 노드)
preceding-sibling	현재 노드 이전의 형제 노드

In [ ]:

xpath = './/div[@id="ipo"]/following-sibling::table[1]'
# response = requests.get(url).text
# response_lxml = fromstring(response)
# tables = response_lxml.xpath(xpath)[0]