数据抓取：¶

Requests、Beautifulsoup、Xpath简介¶

王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com

爬虫基本原理¶

http://www.cnblogs.com/zhaof/p/6898138.html

需要解决的问题¶

页面解析
获取Javascript隐藏源数据
自动翻页
自动登录
连接API接口

一般的数据抓取，使用requests和beautifulsoup配合就可以了。
尤其是对于翻页时url出现规则变化的网页，只需要处理规则化的url就可以了。
以简单的例子是抓取天涯论坛上关于某一个关键词的帖子。
- 在天涯论坛，关于雾霾的帖子的第一页是：

http://bbs.tianya.cn/list.jsp?item=free&nextid=0&order=8&k=%E9%9B%BE%E9%9C%BE - 第二页是： http://bbs.tianya.cn/list.jsp?item=free&nextid=1&order=8&k=%E9%9B%BE%E9%9C%BE

第一个爬虫¶

Beautifulsoup Quick Start

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

http://computational-class.github.io/bigdata/data/test.html

In [1]:

import requests
from bs4 import BeautifulSoup

In [53]:

help(requests.get) 

Help on function get in module requests.api:

get(url, params=None, **kwargs)
    Sends a GET request.
    
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response

In [5]:

url = 'http://computational-class.github.io/bigdata/data/test.html'
content = requests.get(url)
help(content)

Help on Response in module requests.models object:

class Response(builtins.object)
 |  The :class:`Response <Response>` object, which contains a
 |  server's response to an HTTP request.
 |  
 |  Methods defined here:
 |  
 |  __bool__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if the status code of the response is between
 |      400 and 600 to see if there was a client error or a server error. If
 |      the status code, is between 200 and 400, this will return True. This
 |      is **not** a check to see if the response code is ``200 OK``.
 |  
 |  __getstate__(self)
 |  
 |  __init__(self)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |      Allows you to use a response as an iterator.
 |  
 |  __nonzero__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if the status code of the response is between
 |      400 and 600 to see if there was a client error or a server error. If
 |      the status code, is between 200 and 400, this will return True. This
 |      is **not** a check to see if the response code is ``200 OK``.
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  close(self)
 |      Releases the connection back to the pool. Once this method has been
 |      called the underlying ``raw`` object must not be accessed again.
 |      
 |      *Note: Should not normally need to be called explicitly.*
 |  
 |  iter_content(self, chunk_size=1, decode_unicode=False)
 |      Iterates over the response data.  When stream=True is set on the
 |      request, this avoids reading the content at once into memory for
 |      large responses.  The chunk size is the number of bytes it should
 |      read into memory.  This is not necessarily the length of each item
 |      returned as decoding can take place.
 |      
 |      chunk_size must be of type int or None. A value of None will
 |      function differently depending on the value of `stream`.
 |      stream=True will read data as it arrives in whatever size the
 |      chunks are received. If stream=False, data is returned as
 |      a single chunk.
 |      
 |      If decode_unicode is True, content will be decoded using the best
 |      available encoding based on the response.
 |  
 |  iter_lines(self, chunk_size=512, decode_unicode=None, delimiter=None)
 |      Iterates over the response data, one line at a time.  When
 |      stream=True is set on the request, this avoids reading the
 |      content at once into memory for large responses.
 |      
 |      .. note:: This method is not reentrant safe.
 |  
 |  json(self, **kwargs)
 |      Returns the json-encoded content of a response, if any.
 |      
 |      :param \*\*kwargs: Optional arguments that ``json.loads`` takes.
 |      :raises ValueError: If the response body does not contain valid json.
 |  
 |  raise_for_status(self)
 |      Raises stored :class:`HTTPError`, if one occurred.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  apparent_encoding
 |      The apparent encoding, provided by the chardet library
 |  
 |  content
 |      Content of the response, in bytes.
 |  
 |  is_permanent_redirect
 |      True if this Response one of the permanent versions of redirect
 |  
 |  is_redirect
 |      True if this Response is a well-formed HTTP redirect that could have
 |      been processed automatically (by :meth:`Session.resolve_redirects`).
 |  
 |  links
 |      Returns the parsed header links of the response, if any.
 |  
 |  ok
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if the status code of the response is between
 |      400 and 600 to see if there was a client error or a server error. If
 |      the status code, is between 200 and 400, this will return True. This
 |      is **not** a check to see if the response code is ``200 OK``.
 |  
 |  text
 |      Content of the response, in unicode.
 |      
 |      If Response.encoding is None, encoding will be guessed using
 |      ``chardet``.
 |      
 |      The encoding of the response content is determined based solely on HTTP
 |      headers, following RFC 2616 to the letter. If you can take advantage of
 |      non-HTTP knowledge to make a better guess at the encoding, you should
 |      set ``r.encoding`` appropriately before accessing this property.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __attrs__ = ['_content', 'status_code', 'headers', 'url', 'history', '...

In [6]:

print(content.text)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

In [7]:

content.encoding

Out[7]:

'utf-8'

Beautiful Soup¶

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup provides a few simple methods. It doesn't take much code to write an application
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Then you just have to specify the original encoding.
Beautiful Soup sits on top of popular Python parsers like lxml and html5lib.

Install beautifulsoup4¶

open your terminal/cmd¶

~~$ pip install beautifulsoup4~~

html.parser¶

Beautiful Soup supports the html.parser included in Python’s standard library

lxml¶

but it also supports a number of third-party Python parsers. One is the lxml parser lxml. Depending on your setup, you might install lxml with one of these commands:

$ apt-get install python-lxml

$ easy_install lxml

$ pip install lxml

html5lib¶

Another alternative is the pure-Python html5lib parser html5lib, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:

$ apt-get install python-html5lib

$ easy_install html5lib

$ pip install html5lib

In [2]:

url = 'http://computational-class.github.io/bigdata/data/test.html'
content = requests.get(url)
content = content.text
soup = BeautifulSoup(content, 'html.parser') 
soup

Out[2]:

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body></html>

In [10]:

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

html
- head
  - title
- body
  - p (class = 'title', 'story' )
    - a (class = 'sister')
      - href/id

Select 方法¶

标签名不加任何修饰
类名前加点
id名前加 #

我们也可以利用这种特性，使用soup.select()方法筛选元素，返回类型是 list

Select方法三步骤¶

Inspect (检查)
Copy
- Copy Selector

鼠标选中标题The Dormouse's story, 右键检查Inspect
鼠标移动到选中的源代码
右键Copy-->Copy Selector

body > p.title > b

In [4]:

soup.select('body > p.title > b')[0].text

Out[4]:

"The Dormouse's story"

Select 方法: 通过标签名查找¶

In [5]:

soup.select('title')

Out[5]:

[<title>The Dormouse's story</title>]

In [6]:

soup.select('a')

Out[6]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [7]:

soup.select('b')

Out[7]:

[<b>The Dormouse's story</b>]

Select 方法: 通过类名查找¶

In [8]:

soup.select('.title')

Out[8]:

[<p class="title"><b>The Dormouse's story</b></p>]

In [26]:

soup.select('.sister')

Out[26]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [27]:

soup.select('.story')

Out[27]:

[<p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>, <p class="story">...</p>]

Select 方法: 通过id名查找¶

In [9]:

soup.select('#link1')

Out[9]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [16]:

soup.select('#link1')[0]['href']

Out[16]:

'http://example.com/elsie'

Select 方法: 组合查找¶

将标签名、类名、id名进行组合

例如查找 p 标签中，id 等于 link1的内容

In [10]:

soup.select('p #link1')

Out[10]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

Select 方法:属性查找¶

加入属性元素

属性需要用中括号>连接
属性和标签属于同一节点，中间不能加空格。

In [17]:

soup.select("head > title")

Out[17]:

[<title>The Dormouse's story</title>]

In [72]:

soup.select("body > p")

Out[72]:

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

find_all方法¶

In [30]:

soup('p')

Out[30]:

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [31]:

soup.find_all('p')

Out[31]:

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [32]:

[i.text for i in soup('p')]

Out[32]:

["The Dormouse's story",
 'Once upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.',
 '...']

In [34]:

for i in soup('p'):
    print(i.text)

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

In [35]:

for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
p

In [36]:

soup('head') # or soup.head

Out[36]:

[<head><title>The Dormouse's story</title></head>]

In [37]:

soup('body') # or soup.body

Out[37]:

[<body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p></body>]

In [38]:

soup('title')  # or  soup.title

Out[38]:

[<title>The Dormouse's story</title>]

In [39]:

soup('p')

Out[39]:

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [40]:

soup.p

Out[40]:

<p class="title"><b>The Dormouse's story</b></p>

In [41]:

soup.title.name

Out[41]:

'title'

In [42]:

soup.title.string

Out[42]:

"The Dormouse's story"

In [43]:

soup.title.text
# 推荐使用text方法

Out[43]:

"The Dormouse's story"

In [44]:

soup.title.parent.name

Out[44]:

'head'

In [45]:

soup.p

Out[45]:

<p class="title"><b>The Dormouse's story</b></p>

In [46]:

soup.p['class']

Out[46]:

['title']

In [47]:

soup.find_all('p', {'class', 'title'})

Out[47]:

[<p class="title"><b>The Dormouse's story</b></p>]

In [19]:

soup.find_all('p', class_= 'title')

Out[19]:

"The Dormouse's story"

In [49]:

soup.find_all('p', {'class', 'story'})

Out[49]:

[<p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>, <p class="story">...</p>]

In [34]:

soup.find_all('p', {'class', 'story'})[0].find_all('a')

Out[34]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [51]:

soup.a

Out[51]:

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [52]:

soup('a')

Out[52]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [53]:

soup.find(id="link3")

Out[53]:

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [54]:

soup.find_all('a')

Out[54]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [55]:

soup.find_all('a', {'class', 'sister'}) # compare with soup.find_all('a')

Out[55]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [56]:

soup.find_all('a', {'class', 'sister'})[0]

Out[56]:

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [57]:

soup.find_all('a', {'class', 'sister'})[0].text

Out[57]:

'Elsie'

In [58]:

soup.find_all('a', {'class', 'sister'})[0]['href']

Out[58]:

'http://example.com/elsie'

In [59]:

soup.find_all('a', {'class', 'sister'})[0]['id']

Out[59]:

'link1'

In [71]:

soup.find_all(["a", "b"])

Out[71]:

[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [38]:

print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

数据抓取：¶

抓取微信公众号文章内容¶

王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com

In [11]:

from IPython.display import display_html, HTML
HTML(url = 'http://mp.weixin.qq.com/s?__biz=MzA3MjQ5MTE3OA==&mid=206241627&idx=1&sn=471e59c6cf7c8dae452245dbea22c8f3&3rd=MzA3MDU4NTYzMw==&scene=6#rd')
# the webpage we would like to crawl

Out[11]:

南大新传 | 微议题：地震中民族自豪—“中国人先撤”

南大新传院微议题排行榜

点击上方“微议题排行榜”可以订阅哦！

导读

2015年4月25日，尼泊尔发生8.1级地震，造成至少7000多人死亡，中国西藏、印度、孟加拉国、不丹等地均出现人员伤亡。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。

我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。

热词图现

本文以“地震”为关键词，选取了2015年4月10日至4月30日期间微议题TOP100阅读排行进行分析。根据微议题TOP100标题的词频统计，我们可以看出有关“地震”的话题最热词汇的有“尼泊尔”、“油价”、“发改委”。4月25日尼泊尔发生了8级地震，深受人们的关注。面对国外灾难性事件，微媒体的重心却转向“油价”、“发改委”、“祖国先撤”，致力于将世界重大事件与中国政府关联起来。

微议题演化趋势

总文章数

总阅读数

从4月10日到4月30日，有关“地震”议题出现三个峰值，分别是在4月15日内蒙古地震，20日台湾地震和25日尼泊尔地震。其中对台湾地震与内蒙古地震报道文章较少，而对尼泊尔地震却给予了极大的关注，无论是在文章量还是阅读量上都空前增多。内蒙古、台湾地震由于级数较小，关注少，议程时间也比较短，一般3天后就会淡出公共视野。而尼泊尔地震虽然接近性较差，但规模大，且衍生话题性较强，其讨论热度持续了一周以上。

议题分类

如图，我们将此议题分为6大类。

尼泊尔地震

这类文章是对4月25日尼泊尔地震的新闻报道，包括现场视频，地震强度、规模，损失程度、遇难人员介绍等。更进一步的，有对尼泊尔地震原因探析，认为其处在板块交界处，灾难是必然的。因尼泊尔是佛教圣地，也有从佛学角度解释地震的启示。

国内地震报道

主要是对10日内蒙古、甘肃、山西等地的地震，以及20日台湾地震的报道。偏重于对硬新闻的呈现，介绍地震范围、级数、伤亡情况，少数几篇是对甘肃地震的辟谣，称其只是微震。

中国救援回应

地震救援的报道大多是与尼泊尔地震相关，并且80%的文章是中国政府做出迅速反应派出救援机接国人回家。以“中国人又先撤了”，来为祖国点赞。少数几篇是滴滴快的、腾讯基金、万达等为尼泊尔捐款的消息。

发改委与地震

这类文章内容相似，纯粹是对发改委的调侃。称其“预测”地震非常准确，只要一上调油价，便会发生地震。

地震常识介绍

该类文章介绍全国地震带、地震频发地，地震逃生注意事项，“专家传受活命三角”，如何用手机自救等小常识。

地震中的故事

讲述地震中的感人瞬间，回忆汶川地震中的故事，传递“：地震无情，人间有情”的正能量。

国内外地震关注差异大

关于“地震”本身的报道仍旧是媒体关注的重点，尼泊尔地震与国内地震报道占一半的比例。而关于尼泊尔话题的占了45%，国内地震相关的只有22%。微媒体对国内外地震关注有明显的偏差，而且在衍生话题方面也相差甚大。尼泊尔地震中，除了硬新闻报道外，还有对其原因分析、中国救援情况等，而国内地震只是集中于硬新闻。地震常识介绍只占9%，地震知识普及还比较欠缺。

阅读与点赞分析

爱国新闻容易激起点赞狂潮

整体上来说，网民对地震议题关注度较高，自然灾害类话题一旦爆发，很容易引起人们情感共鸣，掀起热潮。但从点赞数来看，“中国救援回应”类的总点赞与平均点赞都是最高的，网民对地震的关注点并非地震本身，而是与之相关的“政府行动”。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。而爱国新闻则往往是最容易煽动民族情绪，产生民族优越感，激起点赞狂潮。

人的关注小于国民尊严的保护

另一方面，国内地震的关注度却很少，不仅体现在政府救援的报道量小，网民的兴趣点与评价也较低。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。“发改委与地震”的点赞量也相对较高，网民对发改委和地震的调侃，反映出的是对油价上涨的不满，这种“怨气”也容易产生共鸣。一面是民族优越感，一面是对政策不满，两种情绪虽矛盾，但同时体现了网民心理趋同。

数据附表

微文章排行TOP50：

公众号排行TOP20：

作者：晏雪菲

出品单位：南京大学计算传播学实验中心

技术支持：南京大学谷尼舆情监测分析实验室

题图鸣谢：谷尼舆情新微榜、图悦词云

查看源代码 Inspect¶

In [12]:

url = "http://mp.weixin.qq.com/s?__biz=MzA3MjQ5MTE3OA==&mid=206241627&idx=1&sn=471e59c6cf7c8dae452245dbea22c8f3&3rd=MzA3MDU4NTYzMw==&scene=6#rd"
content = requests.get(url).text #获取网页的html文本
soup = BeautifulSoup(content, 'html.parser') 

In [17]:

title = soup.select("#activity-name") # #activity-name
title[0].text.strip()

Out[17]:

'南大新传 | 微议题：地震中民族自豪—“中国人先撤”'

In [18]:

soup.find('h2', {'class', 'rich_media_title'}).text.strip()

Out[18]:

'南大新传 | 微议题：地震中民族自豪—“中国人先撤”'

In [25]:

print(soup.find('div', {'class', 'rich_media_meta_list'}) )

<div class="rich_media_meta_list" id="meta_content">
<span class="rich_media_meta rich_media_meta_text">
                                                南大新传院
                                            </span>
<span class="rich_media_meta rich_media_meta_nickname" id="profileBt">
<a href="javascript:void(0);" id="js_name">
                        微议题排行榜                      </a>
<div class="profile_container" id="js_profile_qrcode" style="display:none;">
<div class="profile_inner">
<strong class="profile_nickname">微议题排行榜</strong>
<img alt="" class="profile_avatar" id="js_profile_qrcode_img" src="">
<p class="profile_meta">
<label class="profile_meta_label">微信号</label>
<span class="profile_meta_value">IssuesRank</span>
</p>
<p class="profile_meta">
<label class="profile_meta_label">功能介绍</label>
<span class="profile_meta_value">感谢关注《微议题排行榜》。我们是南京大学新闻传播学院，计算传播学实验中心，致力于研究社会化媒体时代的公共议程，发布新媒体平台的议题排行榜。</span>
</p>
</img></div>
<span class="profile_arrow_wrp" id="js_profile_arrow_wrp">
<i class="profile_arrow arrow_out"></i>
<i class="profile_arrow arrow_in"></i>
</span>
</div>
</span>
<em class="rich_media_meta rich_media_meta_text" id="publish_time"></em>
</div>

In [26]:

soup.select('#publish_time')

Out[26]:

[<em class="rich_media_meta rich_media_meta_text" id="publish_time"></em>]

In [27]:

article = soup.find('div', {'class' , 'rich_media_content'}).text
print(article)

点击上方“微议题排行榜”可以订阅哦！导读2015年4月25日，尼泊尔发生8.1级地震，造成至少7000多人死亡，中国西藏、印度、孟加拉国、不丹等地均出现人员伤亡。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。  热词图现 本文以“地震”为关键词，选取了2015年4月10日至4月30日期间微议题TOP100阅读排行进行分析。根据微议题TOP100标题的词频统计，我们可以看出有关“地震”的话题最热词汇的有“尼泊尔”、“油价”、“发改委”。4月25日尼泊尔发生了8级地震，深受人们的关注。面对国外灾难性事件，微媒体的重心却转向“油价”、“发改委”、“祖国先撤”，致力于将世界重大事件与中国政府关联起来。  微议题演化趋势 总文章数总阅读数从4月10日到4月30日，有关“地震”议题出现三个峰值，分别是在4月15日内蒙古地震，20日台湾地震和25日尼泊尔地震。其中对台湾地震与内蒙古地震报道文章较少，而对尼泊尔地震却给予了极大的关注，无论是在文章量还是阅读量上都空前增多。内蒙古、台湾地震由于级数较小，关注少，议程时间也比较短，一般3天后就会淡出公共视野。而尼泊尔地震虽然接近性较差，但规模大，且衍生话题性较强，其讨论热度持续了一周以上。  议题分类 如图，我们将此议题分为6大类。1尼泊尔地震这类文章是对4月25日尼泊尔地震的新闻报道，包括现场视频，地震强度、规模，损失程度、遇难人员介绍等。更进一步的，有对尼泊尔地震原因探析，认为其处在板块交界处，灾难是必然的。因尼泊尔是佛教圣地，也有从佛学角度解释地震的启示。2国内地震报道主要是对10日内蒙古、甘肃、山西等地的地震，以及20日台湾地震的报道。偏重于对硬新闻的呈现，介绍地震范围、级数、伤亡情况，少数几篇是对甘肃地震的辟谣，称其只是微震。3中国救援回应地震救援的报道大多是与尼泊尔地震相关，并且80%的文章是中国政府做出迅速反应派出救援机接国人回家。以“中国人又先撤了”，来为祖国点赞。少数几篇是滴滴快的、腾讯基金、万达等为尼泊尔捐款的消息。4发改委与地震这类文章内容相似，纯粹是对发改委的调侃。称其“预测”地震非常准确，只要一上调油价，便会发生地震。5地震常识介绍该类文章介绍全国地震带、地震频发地，地震逃生注意事项，“专家传受活命三角”，如何用手机自救等小常识。6地震中的故事讲述地震中的感人瞬间，回忆汶川地震中的故事，传递“：地震无情，人间有情”的正能量。 国内外地震关注差异大关于“地震”本身的报道仍旧是媒体关注的重点，尼泊尔地震与国内地震报道占一半的比例。而关于尼泊尔话题的占了45%，国内地震相关的只有22%。微媒体对国内外地震关注有明显的偏差，而且在衍生话题方面也相差甚大。尼泊尔地震中，除了硬新闻报道外，还有对其原因分析、中国救援情况等，而国内地震只是集中于硬新闻。地震常识介绍只占9%，地震知识普及还比较欠缺。  阅读与点赞分析  爱国新闻容易激起点赞狂潮整体上来说，网民对地震议题关注度较高，自然灾害类话题一旦爆发，很容易引起人们情感共鸣，掀起热潮。但从点赞数来看，“中国救援回应”类的总点赞与平均点赞都是最高的，网民对地震的关注点并非地震本身，而是与之相关的“政府行动”。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。而爱国新闻则往往是最容易煽动民族情绪，产生民族优越感，激起点赞狂潮。 人的关注小于国民尊严的保护另一方面，国内地震的关注度却很少，不仅体现在政府救援的报道量小，网民的兴趣点与评价也较低。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。“发改委与地震”的点赞量也相对较高，网民对发改委和地震的调侃，反映出的是对油价上涨的不满，这种“怨气”也容易产生共鸣。一面是民族优越感，一面是对政策不满，两种情绪虽矛盾，但同时体现了网民心理趋同。  数据附表 微文章排行TOP50：公众号排行TOP20：作者：晏雪菲出品单位：南京大学计算传播学实验中心技术支持：南京大学谷尼舆情监测分析实验室题图鸣谢：谷尼舆情新微榜、图悦词云

In [30]:

rmml = soup.find('div', {'class', 'rich_media_meta_list'})
#date = rmml.find(id = 'post-date').text
rmc = soup.find('div', {'class', 'rich_media_content'})
content = rmc.get_text()
print(title[0].text.strip())
#print(date)
print(content) 

南大新传 | 微议题：地震中民族自豪—“中国人先撤”

点击上方“微议题排行榜”可以订阅哦！导读2015年4月25日，尼泊尔发生8.1级地震，造成至少7000多人死亡，中国西藏、印度、孟加拉国、不丹等地均出现人员伤亡。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。  热词图现 本文以“地震”为关键词，选取了2015年4月10日至4月30日期间微议题TOP100阅读排行进行分析。根据微议题TOP100标题的词频统计，我们可以看出有关“地震”的话题最热词汇的有“尼泊尔”、“油价”、“发改委”。4月25日尼泊尔发生了8级地震，深受人们的关注。面对国外灾难性事件，微媒体的重心却转向“油价”、“发改委”、“祖国先撤”，致力于将世界重大事件与中国政府关联起来。  微议题演化趋势 总文章数总阅读数从4月10日到4月30日，有关“地震”议题出现三个峰值，分别是在4月15日内蒙古地震，20日台湾地震和25日尼泊尔地震。其中对台湾地震与内蒙古地震报道文章较少，而对尼泊尔地震却给予了极大的关注，无论是在文章量还是阅读量上都空前增多。内蒙古、台湾地震由于级数较小，关注少，议程时间也比较短，一般3天后就会淡出公共视野。而尼泊尔地震虽然接近性较差，但规模大，且衍生话题性较强，其讨论热度持续了一周以上。  议题分类 如图，我们将此议题分为6大类。1尼泊尔地震这类文章是对4月25日尼泊尔地震的新闻报道，包括现场视频，地震强度、规模，损失程度、遇难人员介绍等。更进一步的，有对尼泊尔地震原因探析，认为其处在板块交界处，灾难是必然的。因尼泊尔是佛教圣地，也有从佛学角度解释地震的启示。2国内地震报道主要是对10日内蒙古、甘肃、山西等地的地震，以及20日台湾地震的报道。偏重于对硬新闻的呈现，介绍地震范围、级数、伤亡情况，少数几篇是对甘肃地震的辟谣，称其只是微震。3中国救援回应地震救援的报道大多是与尼泊尔地震相关，并且80%的文章是中国政府做出迅速反应派出救援机接国人回家。以“中国人又先撤了”，来为祖国点赞。少数几篇是滴滴快的、腾讯基金、万达等为尼泊尔捐款的消息。4发改委与地震这类文章内容相似，纯粹是对发改委的调侃。称其“预测”地震非常准确，只要一上调油价，便会发生地震。5地震常识介绍该类文章介绍全国地震带、地震频发地，地震逃生注意事项，“专家传受活命三角”，如何用手机自救等小常识。6地震中的故事讲述地震中的感人瞬间，回忆汶川地震中的故事，传递“：地震无情，人间有情”的正能量。 国内外地震关注差异大关于“地震”本身的报道仍旧是媒体关注的重点，尼泊尔地震与国内地震报道占一半的比例。而关于尼泊尔话题的占了45%，国内地震相关的只有22%。微媒体对国内外地震关注有明显的偏差，而且在衍生话题方面也相差甚大。尼泊尔地震中，除了硬新闻报道外，还有对其原因分析、中国救援情况等，而国内地震只是集中于硬新闻。地震常识介绍只占9%，地震知识普及还比较欠缺。  阅读与点赞分析  爱国新闻容易激起点赞狂潮整体上来说，网民对地震议题关注度较高，自然灾害类话题一旦爆发，很容易引起人们情感共鸣，掀起热潮。但从点赞数来看，“中国救援回应”类的总点赞与平均点赞都是最高的，网民对地震的关注点并非地震本身，而是与之相关的“政府行动”。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。而爱国新闻则往往是最容易煽动民族情绪，产生民族优越感，激起点赞狂潮。 人的关注小于国民尊严的保护另一方面，国内地震的关注度却很少，不仅体现在政府救援的报道量小，网民的兴趣点与评价也较低。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。“发改委与地震”的点赞量也相对较高，网民对发改委和地震的调侃，反映出的是对油价上涨的不满，这种“怨气”也容易产生共鸣。一面是民族优越感，一面是对政策不满，两种情绪虽矛盾，但同时体现了网民心理趋同。  数据附表 微文章排行TOP50：公众号排行TOP20：作者：晏雪菲出品单位：南京大学计算传播学实验中心技术支持：南京大学谷尼舆情监测分析实验室题图鸣谢：谷尼舆情新微榜、图悦词云

wechatsogou¶

pip install wechatsogou --upgrade

https://github.com/Chyroc/WechatSogou

In [15]:

!pip install wechatsogou --upgrade

Collecting wechatsogou
  Downloading https://files.pythonhosted.org/packages/37/f0/b4699c0f04cd7bd0c51f8039bc7f2797ba92dd6fc1effeec230868b33ef4/wechatsogou-4.5.4-py2.py3-none-any.whl (45kB)
    100% |████████████████████████████████| 51kB 62kB/s ta 0:00:01
Requirement already satisfied, skipping upgrade: lxml in /Users/datalab/Applications/anaconda/lib/python3.5/site-packages (from wechatsogou) (3.6.4)
Collecting bs4 (from wechatsogou)
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Requirement already satisfied, skipping upgrade: Werkzeug in /Users/datalab/Applications/anaconda/lib/python3.5/site-packages (from wechatsogou) (0.11.11)
Requirement already satisfied, skipping upgrade: Pillow in /Users/datalab/Applications/anaconda/lib/python3.5/site-packages (from wechatsogou) (5.0.0)
Requirement already satisfied, skipping upgrade: future in /Users/datalab/Applications/anaconda/lib/python3.5/site-packages (from wechatsogou) (0.16.0)
Requirement already satisfied, skipping upgrade: six in /Users/datalab/Applications/anaconda/lib/python3.5/site-packages (from wechatsogou) (1.11.0)
Requirement already satisfied, skipping upgrade: requests in /Users/datalab/Applications/anaconda/lib/python3.5/site-packages (from wechatsogou) (2.14.2)
Requirement already satisfied, skipping upgrade: xlrd in /Users/datalab/Applications/anaconda/lib/python3.5/site-packages (from wechatsogou) (1.0.0)
Requirement already satisfied, skipping upgrade: beautifulsoup4 in /Users/datalab/Applications/anaconda/lib/python3.5/site-packages (from bs4->wechatsogou) (4.5.1)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... done
  Stored in directory: /Users/datalab/Library/Caches/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4, wechatsogou
Successfully installed bs4-0.0.1 wechatsogou-4.5.4
You are using pip version 19.0.3, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [16]:

import wechatsogou

# 可配置参数

# 直连
ws_api = wechatsogou.WechatSogouAPI()

# 验证码输入错误的重试次数，默认为1
ws_api = wechatsogou.WechatSogouAPI(captcha_break_time=3)

# 所有requests库的参数都能在这用
# 如 配置代理，代理列表中至少需包含1个 HTTPS 协议的代理, 并确保代理可用
ws_api = wechatsogou.WechatSogouAPI(proxies={
    "http": "127.0.0.1:8889",
    "https": "127.0.0.1:8889",
})

# 如 设置超时
ws_api = wechatsogou.WechatSogouAPI(timeout=0.1)

In [17]:

ws_api =wechatsogou.WechatSogouAPI()
ws_api.get_gzh_info('南航青年志愿者')

please input code: HK37T5

Out[17]:

{'authentication': '\n',
 'headimage': 'https://img01.sogoucdn.com/app/a/100520090/oIWsFt1tmWoG6vO6BcsS7St61bRE',
 'introduction': '南航大志愿活动的领跑者,为你提供校内外的志愿资源和精彩消息.',
 'open_id': 'oIWsFt1tmWoG6vO6BcsS7St61bRE',
 'post_perm': 26,
 'profile_url': 'http://mp.weixin.qq.com/profile?src=3&timestamp=1560004347&ver=1&signature=OpcTZp20TUdKHjSqWh7m73RWBIzwYwINpib2ZktBkLEWS9*s61AB7DxViyL4XhpOGnibYEovkq8eELKID5futg==',
 'qrcode': 'http://mp.weixin.qq.com/rr?src=3&timestamp=1560004347&ver=1&signature=-DnFampQflbiOadckRJaTaDRzGSNfisIfECELSo-lN-GeEOH8-XTtM*ASdavl0xuGzG1yEhZNZAikkhyhU*C93uHbUVw8Ht7qn8MtInPfZ8=',
 'view_perm': 365,
 'wechat_id': 'nanhangqinggong',
 'wechat_name': '南航青年志愿者'}

In [19]:

articles = ws_api.search_article('南京航空航天大学')

In [20]:

for i in articles:
    print(i)

{'article': {'title': '南航PK北外,这一轮拉歌高科技又养眼,你选谁?', 'abstract': '\u200d今天,南京航空航天大学PK北京外国语大学 在重点实验室里,在航空航天馆中,在国产大飞机前,南航人齐声唱响《歌唱祖国...', 'time': 1556781419, 'imgs': ['https://img01.sogoucdn.com/net/a/04/link?appid=100520033&url=http://mmbiz.qpic.cn/mmbiz_jpg/oq1PymRl9D4mFNqBHicjHOYp2m7uRHJhN5oqeeKE0Msebpqqj1tg1K0gPEnLZnfROzwPWvMcdebVHB7ibgjg6SGw/0?wx_fmt=jpeg'], 'url': 'http://mp.weixin.qq.com/s?src=11&timestamp=1560004408&ver=1656&signature=1o9CmS1bOg4M7mZDqYRx-jW5Vb2YqDMV2k5rPJTIyThFPbsagSOjJSRCv2JsuA5k8c07o*P0ZN6SDlmWnhkzC*OXrO9AMuEaC1YUZ6rLGkkQ10il2u3vA3VmR4UYw319&new=1'}, 'gzh': {'headimage': 'http://wx.qlogo.cn/mmhead/Q3auHgzwzM7K1jDqNGenoK7DmzRhYy9KqAXmqMNS8c99Yfy1cfHw9Q/0', 'isv': 1, 'profile_url': 'http://mp.weixin.qq.com/profile?src=3&timestamp=1560004408&ver=1&signature=mFCwcLO9hTwe*Js7TGQ457olpvr1d85gJSnVLyFgtYlw2lpbi2onr1QxZrtxIjlO4jWp*U8JuvtQ-NwVRTEYWQ==', 'wechat_name': '央视新闻'}}
{'article': {'title': '别人家的学校!南航龙虾节来了!', 'abstract': '南航龙虾美食节南京航天航空大学明故宫校区食堂:第二食堂地址:秦淮区御道街29号交通:地铁2号线明故宫下南京航空航天大学将...', 'time': 1559044955, 'imgs': ['https://img01.sogoucdn.com/net/a/04/link?appid=100520033&url=http://mmbiz.qpic.cn/mmbiz_jpg/EteGOZsBKnk7kPDfHiaKM0qw5iaSErOUD7Mnz7wcy3WxzgBcTBJls6mtia0ughp0CHbNzIK4z90qQ2lesuZyYTia6Q/0?wx_fmt=jpeg'], 'url': 'http://mp.weixin.qq.com/s?src=11&timestamp=1560004408&ver=1656&signature=-FCYyno3TxYBbdlMHwbSrWOfKA115mKCZgVMcnN4JFirsFU*C1o07R*ApwW7cdGWdWCSkjv-ZS0vl4duX21wUOq0T80bgJaxSS8UPuQtoMPinRFfmg*EH-H60B-6iQT2&new=1'}, 'gzh': {'headimage': 'http://wx.qlogo.cn/mmhead/Q3auHgzwzM6NEkWdTqa9dLlyZommfLzcUNAYLPSVnU4XYNpxo4UkWw/0', 'isv': 0, 'profile_url': 'http://mp.weixin.qq.com/profile?src=3&timestamp=1560004408&ver=1&signature=Gh-T-0msiOTZRMh-i-F85rvuaFiXP5ht3mvmMOoFkbXSWuRoDhnuAW4Fk5WpqmfOSWmEfwNARz3PyG8p0Jyq4w==', 'wechat_name': '南京头条'}}
{'article': {'title': '南航失联女大学生自行返校!失联原因意外又扎心......', 'abstract': '南京航空航天大学发公告称,该女大学生因手机没电后失去联系.近日,有消息称,5月5日晚,南京航空航天大学江宁校区一名女大学...', 'time': 1557221370, 'imgs': ['https://img01.sogoucdn.com/net/a/04/link?appid=100520033&url=http://mmbiz.qpic.cn/mmbiz_jpg/qkQTRn2Z9NyEw9R043Q8x4PvpwI07N6YOwoicGfFFxYNYGTfWF6Hzl37zTo1eX7MfTXj5xEB8Qm5a2xJPWLbbpw/0?wx_fmt=jpeg'], 'url': 'http://mp.weixin.qq.com/s?src=11&timestamp=1560004408&ver=1656&signature=xam5xXD4ybtHin*kmZ5JTyTo5tIhuHI4M095L00Ee0uPcnLhFVQtyMY0ySk9n4RSDze3-IitxySOL82hBpmje6teJdXEaaJfy5fjmH8oFDNqamj3Ia58z-R8G1DL5mqk&new=1'}, 'gzh': {'headimage': 'http://wx.qlogo.cn/mmhead/Q3auHgzwzM4n67RZ0NUicz2ZRrCUicHHliayU7IAyib5jQxOSXPdYtL0NQ/0', 'isv': 1, 'profile_url': 'http://mp.weixin.qq.com/profile?src=3&timestamp=1560004408&ver=1&signature=h-HZ7WT0dciMDJVrQsaiFysam0gHO3SwtgvucE1trMBZxpW0YLCqiuSgyGQhz3A1rpAO0mim-NM8ot9g5*UAdQ==', 'wechat_name': '环球网'}}
{'article': {'title': '南航女生失联34小时终于找到了!失联原因公布…', 'abstract': '何某是因为毕业设计压力大而出门散心,目前身体精神一切安好.今天一大早,@南京航空航天大学也通过微博发布公告,称该学生已...', 'time': 1557207320, 'imgs': ['https://img01.sogoucdn.com/net/a/04/link?appid=100520033&url=http://mmbiz.qpic.cn/mmbiz_jpg/blibSaqibwQr4Uz4lEqLMkyqajUyooxECViagcDoL22ygrhEtXMF4e9bJUymAncYtSicvZamC5JgTiaXXmhbib7wMNIA/0?wx_fmt=jpeg'], 'url': 'http://mp.weixin.qq.com/s?src=11&timestamp=1560004408&ver=1656&signature=vtV0pu0ds9SMBCX5Ro1vDXvUxZ07NOCHTvCkZi4mgxCuK*N0ba3QDMO61xC4hzh6*R85IJoDAr73G-XZmk9thqnnTmGXAXdT3uixABkWFUX7z*dWx1mk7K*VNPWU12bB&new=1'}, 'gzh': {'headimage': 'http://wx.qlogo.cn/mmhead/Q3auHgzwzM5HYaLQWzdJFFbDh61a1gv4pSNlEkfEF5vSALe9uLZAQQ/0', 'isv': 1, 'profile_url': 'http://mp.weixin.qq.com/profile?src=3&timestamp=1560004408&ver=1&signature=plftUFIybop1WKZVes2ovECOXPjFzAF2*dpZtiSDiRGpoMoqZUjMl7GcbycChsUhmG1tqTWmeDKbeXjAlGkQhw==', 'wechat_name': '中青网教育'}}
{'article': {'title': '南航女大学生坐网约车后失联?已找到!失联原因最新曝光…', 'abstract': '5月5日晚九点,南京航空航天大学大四女生何清在QQ空间发完最后一条动态就没有消息了.那是几张照片,其中有她和男朋友的照片...', 'time': 1557196046, 'imgs': ['https://img01.sogoucdn.com/net/a/04/link?appid=100520033&url=http://mmbiz.qpic.cn/mmbiz_jpg/mxaa4wWaSsLoOVVXMJA8hz9bMBuRChO7xBZ76ibU9KibF4lzJic1cR16ybqSEkmk310KUNKRoJ5V7mef8w4odldwA/0?wx_fmt=jpeg'], 'url': 'http://mp.weixin.qq.com/s?src=11&timestamp=1560004408&ver=1656&signature=7U44bsf4p0fwuE7HXhCJrcesx*TLVdPn5L1aXJ3aF5AghqzC**89CtH7Ck0flhyacoxwtS9D6spDmi6-N*WYWuzTlfT0DNlJ1eYDU1ENvWBOIrDjm*A7h7xKwFytJM-7&new=1'}, 'gzh': {'headimage': 'http://wx.qlogo.cn/mmhead/Q3auHgzwzM6ILJLcmMuPhhBYnKPZiaGe1lfjMTpgvUzgHyA1wL6hGLw/0', 'isv': 1, 'profile_url': 'http://mp.weixin.qq.com/profile?src=3&timestamp=1560004408&ver=1&signature=Zc2XPtLR3Aqp4od0rFh-hX8vA9Xa0ktIpvs4BTQIhqsibnL5ctGuKoDd0rusbdANaY5gOgrhQXvPV*5VUy0dtw==', 'wechat_name': '新民晚报'}}
{'article': {'title': '好消息!南航天目湖校区招生啦!招生简章看这里!', 'abstract': '家门口的大学南航天目湖校区今年招生啦!请各位考生收下强心剂招生章程一份南京航空航天大学2019年招生章程第一章 总则第一条 ...', 'time': 1559285732, 'imgs': ['https://img01.sogoucdn.com/net/a/04/link?appid=100520033&url=http://mmbiz.qpic.cn/mmbiz_jpg/yXbx9vlibWYWfDDTqjTOX7982sOdgTAc9gkGXxDu5z45CdceZSOZc5kcK8mjCkV6P9721AZhr6fqpdOib0dwzedw/0?wx_fmt=jpeg'], 'url': 'http://mp.weixin.qq.com/s?src=11&timestamp=1560004408&ver=1656&signature=d1iVBbakcsS2VALOjychAevBWqajPtAWPuZw6UuRH0rV8quRa3-okHGSmEKXpOzou4ZaJ6VUTo8*toZGWU9*VEcmhLYBIVobioL-OUbFFLjcr3lODY4uuZvhSMM4wutE&new=1'}, 'gzh': {'headimage': 'http://wx.qlogo.cn/mmhead/Q3auHgzwzM7PdUj2NQP48wyqr5uiaGUegHeicicTysuiahLGYD7GKzZibjw/0', 'isv': 0, 'profile_url': 'http://mp.weixin.qq.com/profile?src=3&timestamp=1560004408&ver=1&signature=VtuHhhibvLa5SErWLq9Gwmw*Z2h--97BaixspBe6lsuOU3IZ0Zt1y3*IQFqiFwzY*jMd3vttE4a*FHKa9lktyQ==', 'wechat_name': '溧阳焦点'}}
{'article': {'title': '394人!南大、南航、南理工、南师大自招政策公布了', 'abstract': '联系方式咨询电话:400-1859680招生主页:http://bkzs.nju.edu.cn电子邮箱:bkzs@nju.edu.cn微信公众号:ndzsxlj南京航空航天大学...', 'time': 1554122680, 'imgs': ['https://img01.sogoucdn.com/net/a/04/link?appid=100520033&url=http://mmbiz.qpic.cn/mmbiz_jpg/YZf1n4qCriaQH3Z3jOPKtgPpw4ZfmBvZWH9h1iaLYeZUial4odF1L9v8UR4nhDmrrXrPPEqKr7Vm11LoNgbbib1Fsg/0?wx_fmt=jpeg'], 'url': 'http://mp.weixin.qq.com/s?src=11&timestamp=1560004408&ver=1656&signature=gwGi2e7LJ0jHeEbfH7*pVJ2FgzgVK*mWfJJsEaOVGgPgn-jfF8aHLt7T1ZpQo0ZAfbc-eCn8P8VFn128RPbPRi095qOor6yDswUkkijFzzPDgmgDyaqzrw4peUz0h3cw&new=1'}, 'gzh': {'headimage': 'http://wx.qlogo.cn/mmhead/Q3auHgzwzM4r1jN17mscTSRHHDoBK20yCGPxgtD05vok3jklugbAGw/0', 'isv': 1, 'profile_url': 'http://mp.weixin.qq.com/profile?src=3&timestamp=1560004408&ver=1&signature=2OqwJTEnZF0*iXxCE0E9tvNnSzAHHbbD8cVUZUe2C1Btpk92QOwVqTk7gnAbrtGfltJpmUnXp71tajS6QLaYVA==', 'wechat_name': '南京发布'}}
{'article': {'title': '南京航空航天大学“长空学者”计划国际青年科学家高端论坛诚邀海内外英才', 'abstract': '南京航空航天大学直属于工业和信息化部,是国家“211工程”,“985工程优势学科创新平台”重点建设高校之一,也是52所设立研究...', 'time': 1555409338, 'imgs': ['https://img01.sogoucdn.com/net/a/04/link?appid=100520033&url=http://mmbiz.qpic.cn/mmbiz_jpg/qIaMAUnyMvHTUEuRiciaNWemCSGVhBHfRTssWACdjaoAtrXib9fP4D3j4TNORzJpHNPjcPoDeTGoiaPVFLXUSUO1PQ/0?wx_fmt=jpeg'], 'url': 'http://mp.weixin.qq.com/s?src=11&timestamp=1560004408&ver=1656&signature=D7QUG3vNgKBT6P3mzIAgToaqiOOO*VjOiTXMD-NLTsfLvRjbOCc*BCnKrp3Hlzw5lJepPcA-VPoMAJ0mLmcG1DcwtVPKW90uAe-Fn7aNihMy4rjOf62uWPLH7NdhfI4X&new=1'}, 'gzh': {'headimage': 'http://wx.qlogo.cn/mmhead/Q3auHgzwzM5n85abQtyPORCrckdfrMLssCIsjU7eJTX8XdibHWA9KHw/0', 'isv': 1, 'profile_url': 'http://mp.weixin.qq.com/profile?src=3&timestamp=1560004408&ver=1&signature=S-7U131D3eQERC8yJGVAg2edySXn*qGVi5uE8QyQU038YWJBtJ7MdkHhoe*MmilFBPEQWnoq5nO5qvrbd5mD6A==', 'wechat_name': '南京航空航天大学'}}
{'article': {'title': '南京航空航天大学、南京农业大学2019年自主招生简章发布', 'abstract': '南京航天航空大学计划招生130人,4月10日截止报名.南京农业大学计划招生30人,4月11日截止报名.南京航空航天大学2019年自...', 'time': 1554199767, 'imgs': ['https://img01.sogoucdn.com/net/a/04/link?appid=100520033&url=http://mmbiz.qpic.cn/mmbiz_jpg/HWaP8yWMXVy5aT0bGIFxQsFOPBDIPjCDLk7N4Poia2LS3U5nmtzXxWlNB1W6BmBdxW7tFIuF4MTK2jDe1cY3tIg/0?wx_fmt=jpeg'], 'url': 'http://mp.weixin.qq.com/s?src=11&timestamp=1560004408&ver=1656&signature=*jvRlB4YNP3DnI0ilWxOzOrbbbXlei30CMxIRtPcJl59CZOo4iVzwqnbFiR7JV6zejM2A0XgjgrM12MHN5TLHW8Jm1jSIApMLJDLts9cP14GXl4E9a3F8aT8GUDVkwdP&new=1'}, 'gzh': {'headimage': 'http://wx.qlogo.cn/mmhead/Q3auHgzwzM7JQuh5tqrfibV37E87uXhM3XecSbx96Poo36yLUuuafoA/0', 'isv': 1, 'profile_url': 'http://mp.weixin.qq.com/profile?src=3&timestamp=1560004408&ver=1&signature=tRYZSnr5qzQhfaNJfYyU3rwYdpsyB1nPJb0O53qWIgPS6dsWjTkmWBhJ-JRd-rOTuDEiiZwjpJdxEqDWnBM1Uw==', 'wechat_name': '自主招生指南'}}
{'article': {'title': '南京航空航天大学历年所有专业考研真题', 'abstract': '1、2017年南京航空航天大学教育综合333初试真题2、2017年南京航空航天大学金融学综合431初试真题关注微信公众号,回复南京航...', 'time': 1559831403, 'imgs': ['https://img01.sogoucdn.com/net/a/04/link?appid=100520033&url=http://mmbiz.qpic.cn/mmbiz_jpg/fMhPAhFv0XiaAOlqextqjoFz0gMl8627eeSzd5A3Ky2gKzTXiblhDX3JPN1ZybbfrSSfR0T6aVWznuIoS2OpGoDQ/0?wx_fmt=jpeg'], 'url': 'http://mp.weixin.qq.com/s?src=11&timestamp=1560004408&ver=1656&signature=V*4ar7VODE-INFkge-wYJcwcHnbW3BlHB0GK2SK6ji3m7*ovhgdR*5shEoiY3x*hzYa-a4bfaM12CthrRJGQDRIgaBVFWQWraC7xLgIEf7XVr56noaN1-pUHyQ2uje5A&new=1'}, 'gzh': {'headimage': 'http://wx.qlogo.cn/mmhead/Q3auHgzwzM4MHwaY1n88gpedjUMhZZkiaHKJZMq4aWeZ1lq1UsP5bUw/0', 'isv': 0, 'profile_url': 'http://mp.weixin.qq.com/profile?src=3&timestamp=1560004408&ver=1&signature=jNI2pQzoeP4DLNP*YhjaS8ZItS4WJp7ei*lm45qjJbJlbFfpLwW2C3OpJIsp6GaYNhCTrwSa8i3NGk9jQUcIag==', 'wechat_name': '真题仓'}}

requests + Xpath方法介绍：以豆瓣电影为例¶

Xpath 即为 XML 路径语言（XML Path Language），它是一种用来确定 XML 文档中某部分位置的语言。

Xpath 基于 XML 的树状结构，提供在数据结构树中找寻节点的能力。起初 Xpath 的提出的初衷是将其作为一个通用的、介于 Xpointer 与 XSL 间的语法模型。但是Xpath 很快的被开发者采用来当作小型查询语言。

获取元素的Xpath信息并获得文本：这里的“元素的Xpath信息”是需要我们手动获取的，获取方式为：

定位目标元素
在网站上依次点击：右键 > 检查
copy xpath
xpath + '/text()'

参考：https://mp.weixin.qq.com/s/zx3_eflBCrrfOqFEWjAUJw

In [31]:

import requests
from lxml import etree

url = 'https://movie.douban.com/subject/26611804/'
data = requests.get(url).text
s = etree.HTML(data)  

豆瓣电影的名称对应的的xpath为xpath_title，那么title表达为：

title = s.xpath('xpath_info/text()')

其中，xpath_info为：

//*[@id="content"]/h1/span[1]

In [33]:

title = s.xpath('//*[@id="content"]/h1/span[1]/text()')[0]
director = s.xpath('//*[@id="info"]/span[1]/span[2]/a/text()')
actors = s.xpath('//*[@id="info"]/span[3]/span[2]/a/text()')
type1 = s.xpath('//*[@id="info"]/span[5]/text()')
type2 = s.xpath('//*[@id="info"]/span[6]/text()')
type3 = s.xpath('//*[@id="info"]/span[7]/text()')
time = s.xpath('//*[@id="info"]/span[11]/text()')
length = s.xpath('//*[@id="info"]/span[13]/text()')
score = s.xpath('//*[@id="interest_sectl"]/div[1]/div[2]/strong/text()')[0]

In [34]:

print(title, director, actors, type1, type2, type3, time, length, score)

三块广告牌 Three Billboards Outside Ebbing, Missouri ['马丁·麦克唐纳'] ['弗兰西斯·麦克多蒙德', '伍迪·哈里森', '山姆·洛克威尔', '艾比·考尼什', '卢卡斯·赫奇斯', '彼特·丁拉基', '约翰·浩克斯', '卡赖伯·兰德里·琼斯', '凯瑟琳·纽顿', '凯瑞·康顿', '泽利科·伊万内克', '萨玛拉·维文', '克拉克·彼得斯', '尼克·西塞', '阿曼达·沃伦', '玛拉雅·瑞沃拉·德鲁 ', '布兰登·萨克斯顿', '迈克尔·艾伦·米利甘'] ['剧情'] ['犯罪'] ['官方网站:'] ['2018-03-02(中国大陆)'] ['2017-12-01(美国)'] 8.7

Douban API¶

https://developers.douban.com/wiki/?title=guide

https://github.com/computational-class/douban-api-docs

In [6]:

import requests
# https://movie.douban.com/subject/26611804/
url = 'https://api.douban.com/v2/movie/subject/26611804?apikey=0b2bdeda43b5688921839c8ecb20399b&start=0&count=20&client=&udid='
jsonm = requests.get(url).json()

In [11]:

jsonm.keys()

Out[11]:

dict_keys(['schedule_url', 'title', 'aka', 'photos_count', 'languages', 'year', 'tags', 'blooper_urls', 'images', 'trailers', 'popular_reviews', 'videos', 'summary', 'clip_urls', 'do_count', 'comments_count', 'has_ticket', 'ratings_count', 'countries', 'has_video', 'collect_count', 'wish_count', 'writers', 'directors', 'id', 'mainland_pubdate', 'popular_comments', 'episodes_count', 'website', 'clips', 'casts', 'genres', 'reviews_count', 'douban_site', 'alt', 'pubdate', 'trailer_urls', 'mobile_url', 'share_url', 'durations', 'seasons_count', 'photos', 'pubdates', 'subtype', 'current_season', 'has_schedule', 'bloopers', 'collection', 'rating', 'original_title'])

In [3]:

#jsonm.values()
jsonm['rating']

Out[3]:

(dict_keys(['schedule_url', 'title', 'aka', 'photos_count', 'languages', 'year', 'tags', 'blooper_urls', 'images', 'trailers', 'popular_reviews', 'videos', 'summary', 'clip_urls', 'do_count', 'comments_count', 'has_ticket', 'ratings_count', 'countries', 'has_video', 'collect_count', 'wish_count', 'writers', 'directors', 'id', 'mainland_pubdate', 'popular_comments', 'episodes_count', 'website', 'clips', 'casts', 'genres', 'reviews_count', 'douban_site', 'alt', 'pubdate', 'trailer_urls', 'mobile_url', 'share_url', 'durations', 'seasons_count', 'photos', 'pubdates', 'subtype', 'current_season', 'has_schedule', 'bloopers', 'collection', 'rating', 'original_title']),
 {'average': 7.5,
  'details': {'1': 206.0,
   '2': 1590.0,
   '3': 15843.0,
   '4': 21556.0,
   '5': 7558.0},
  'max': 10,
  'min': 0,
  'stars': '40'})

In [4]:

jsonm['alt']

Out[4]:

'https://movie.douban.com/subject/1764796/'

In [21]:

jsonm['casts'][0]

Out[21]:

{'alt': 'https://movie.douban.com/celebrity/1010548/',
 'avatars': {'large': 'https://img3.doubanio.com/view/celebrity/s_ratio_celebrity/public/p1436865941.42.jpg',
  'medium': 'https://img3.doubanio.com/view/celebrity/s_ratio_celebrity/public/p1436865941.42.jpg',
  'small': 'https://img3.doubanio.com/view/celebrity/s_ratio_celebrity/public/p1436865941.42.jpg'},
 'id': '1010548',
 'name': '弗兰西斯·麦克多蒙德',
 'name_en': 'Frances McDormand'}

In [10]:

jsonm['directors']

Out[10]:

[{'alt': 'https://movie.douban.com/celebrity/1000304/',
  'avatars': {'large': 'https://img3.doubanio.com/view/celebrity/s_ratio_celebrity/public/p1406649730.61.jpg',
   'medium': 'https://img3.doubanio.com/view/celebrity/s_ratio_celebrity/public/p1406649730.61.jpg',
   'small': 'https://img3.doubanio.com/view/celebrity/s_ratio_celebrity/public/p1406649730.61.jpg'},
  'id': '1000304',
  'name': '马丁·麦克唐纳',
  'name_en': 'Martin McDonagh'}]

In [13]:

jsonm['genres']

Out[13]:

['剧情', '犯罪']

作业：抓取豆瓣电影 Top 250¶

In [55]:

import requests
from bs4 import BeautifulSoup
from lxml import etree

url0 = 'https://movie.douban.com/top250?start=0&filter='
data = requests.get(url0).text
s = etree.HTML(data)

In [56]:

s.xpath('//*[@id="content"]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]/text()')[0]

Out[56]:

'肖申克的救赎'

In [57]:

s.xpath('//*[@id="content"]/div/div[1]/ol/li[2]/div/div[2]/div[1]/a/span[1]/text()')[0]

Out[57]:

'霸王别姬'

In [227]:

s.xpath('//*[@id="content"]/div/div[1]/ol/li[3]/div/div[2]/div[1]/a/span[1]/text()')[0]

Out[227]:

'这个杀手不太冷'

In [58]:

import requests
from bs4 import BeautifulSoup

url0 = 'https://movie.douban.com/top250?start=0&filter='
data = requests.get(url0).text
soup = BeautifulSoup(data, 'lxml')

In [59]:

movies = soup.find_all('div', {'class', 'info'})

In [60]:

len(movies)

Out[60]:

In [61]:

movies[0].a['href']

Out[61]:

'https://movie.douban.com/subject/1292052/'

In [62]:

movies[0].find('span', {'class', 'title'}).text

Out[62]:

'肖申克的救赎'

In [63]:

movies[0].find('div', {'class', 'star'})

Out[63]:

<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.6</span>
<span content="10.0" property="v:best"></span>
<span>1444420人评价</span>
</div>

In [64]:

movies[0].find('span', {'class', 'rating_num'}).text

Out[64]:

'9.6'

In [65]:

people_num = movies[0].find('div', {'class', 'star'}).find_all('span')[-1]
people_num.text.split('人评价')[0]

Out[65]:

'1444420'

In [66]:

for i in movies:
    url = i.a['href']
    title = i.find('span', {'class', 'title'}).text
    des = i.find('div', {'class', 'star'})
    rating = des.find('span', {'class', 'rating_num'}).text
    rating_num = des.find_all('span')[-1].text.split('人评价')[0]
    print(url, title, rating, rating_num)

https://movie.douban.com/subject/1292052/ 肖申克的救赎 9.6 1444420
https://movie.douban.com/subject/1291546/ 霸王别姬 9.6 1070028
https://movie.douban.com/subject/1295644/ 这个杀手不太冷 9.4 1315767
https://movie.douban.com/subject/1292720/ 阿甘正传 9.4 1136609
https://movie.douban.com/subject/1292063/ 美丽人生 9.5 666021
https://movie.douban.com/subject/1292722/ 泰坦尼克号 9.3 1077710
https://movie.douban.com/subject/1291561/ 千与千寻 9.3 1063060
https://movie.douban.com/subject/1295124/ 辛德勒的名单 9.5 592924
https://movie.douban.com/subject/3541415/ 盗梦空间 9.3 1137434
https://movie.douban.com/subject/3011091/ 忠犬八公的故事 9.3 753683
https://movie.douban.com/subject/2131459/ 机器人总动员 9.3 753202
https://movie.douban.com/subject/3793023/ 三傻大闹宝莱坞 9.2 1023706
https://movie.douban.com/subject/1292001/ 海上钢琴师 9.2 837910
https://movie.douban.com/subject/1291549/ 放牛班的春天 9.3 710391
https://movie.douban.com/subject/1292064/ 楚门的世界 9.2 787025
https://movie.douban.com/subject/1292213/ 大话西游之大圣娶亲 9.2 793255
https://movie.douban.com/subject/1889243/ 星际穿越 9.2 814373
https://movie.douban.com/subject/1291560/ 龙猫 9.2 702137
https://movie.douban.com/subject/1291841/ 教父 9.3 513800
https://movie.douban.com/subject/5912992/ 熔炉 9.3 462360
https://movie.douban.com/subject/1307914/ 无间道 9.2 652219
https://movie.douban.com/subject/25662329/ 疯狂动物城 9.2 899529
https://movie.douban.com/subject/1849031/ 当幸福来敲门 9.0 829997
https://movie.douban.com/subject/3319755/ 怦然心动 9.0 918966
https://movie.douban.com/subject/6786002/ 触不可及 9.2 547463

In [67]:

for i in range(0, 250, 25):
    print('https://movie.douban.com/top250?start=%d&filter='% i)

https://movie.douban.com/top250?start=0&filter=
https://movie.douban.com/top250?start=25&filter=
https://movie.douban.com/top250?start=50&filter=
https://movie.douban.com/top250?start=75&filter=
https://movie.douban.com/top250?start=100&filter=
https://movie.douban.com/top250?start=125&filter=
https://movie.douban.com/top250?start=150&filter=
https://movie.douban.com/top250?start=175&filter=
https://movie.douban.com/top250?start=200&filter=
https://movie.douban.com/top250?start=225&filter=

In [68]:

import requests
from bs4 import BeautifulSoup
dat = []
for j in range(0, 250, 25):
    urli = 'https://movie.douban.com/top250?start=%d&filter='% j
    data = requests.get(urli).text
    soup = BeautifulSoup(data, 'lxml')
    movies = soup.find_all('div', {'class', 'info'})
    for i in movies:
        url = i.a['href']
        title = i.find('span', {'class', 'title'}).text
        des = i.find('div', {'class', 'star'})
        rating = des.find('span', {'class', 'rating_num'}).text
        rating_num = des.find_all('span')[-1].text.split('人评价')[0]
        listi = [url, title, rating, rating_num]
        dat.append(listi)

In [69]:

import pandas as pd
df = pd.DataFrame(dat, columns = ['url', 'title', 'rating', 'rating_num'])
df['rating'] = df.rating.astype(float)
df['rating_num'] = df.rating_num.astype(int)
df.head()

Out[69]:

	url	title	rating	rating_num
0	https://movie.douban.com/subject/1292052/	肖申克的救赎	9.6	1444420
1	https://movie.douban.com/subject/1291546/	霸王别姬	9.6	1070028
2	https://movie.douban.com/subject/1295644/	这个杀手不太冷	9.4	1315767
3	https://movie.douban.com/subject/1292720/	阿甘正传	9.4	1136609
4	https://movie.douban.com/subject/1292063/	美丽人生	9.5	666021

In [3]:

%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(df.rating_num)
plt.show()

In [19]:

plt.hist(df.rating)
plt.show()

In [11]:

# viz
fig = plt.figure(figsize=(16, 16),facecolor='white')
plt.plot(df.rating_num, df.rating, 'bo')
for i in df.index:
    plt.text(df.rating_num[i], df.rating[i], df.title[i], 
             fontsize = df.rating[i], 
             color = 'red', rotation = 45)
plt.show() 

In [123]:

df[df.rating > 9.4]

Out[123]:

	url	title	rating	rating_num
0	https://movie.douban.com/subject/1292052/	肖申克的救赎	9.6	1004428
1	https://movie.douban.com/subject/1291546/	霸王别姬	9.5	730274
4	https://movie.douban.com/subject/1292063/	美丽人生	9.5	469332
41	https://movie.douban.com/subject/1296141/	控方证人	9.6	108598

In [69]:

alist = []
for i in df.index:
    alist.append( [df.rating_num[i], df.rating[i], df.title[i] ])

blist =[[df.rating_num[i], df.rating[i], df.title[i] ] for i in df.index] 

alist

Out[69]:

[[1021383, 9.5999999999999996, '肖申克的救赎'],
 [742984, 9.5, '霸王别姬'],
 [957578, 9.4000000000000004, '这个杀手不太冷'],
 [814634, 9.4000000000000004, '阿甘正传'],
 [475813, 9.5, '美丽人生'],
 [762619, 9.3000000000000007, '千与千寻'],
 [754309, 9.3000000000000007, '泰坦尼克号'],
 [433191, 9.4000000000000004, '辛德勒的名单'],
 [853620, 9.3000000000000007, '盗梦空间'],
 [559729, 9.3000000000000007, '机器人总动员'],
 [657670, 9.1999999999999993, '海上钢琴师'],
 [767473, 9.1999999999999993, '三傻大闹宝莱坞'],
 [529473, 9.1999999999999993, '忠犬八公的故事'],
 [513071, 9.1999999999999993, '放牛班的春天'],
 [561091, 9.1999999999999993, '大话西游之大圣娶亲'],
 [533017, 9.0999999999999996, '楚门的世界'],
 [473631, 9.0999999999999996, '龙猫'],
 [385130, 9.1999999999999993, '教父'],
 [309138, 9.1999999999999993, '熔炉'],
 [560855, 9.1999999999999993, '星际穿越'],
 [299301, 9.1999999999999993, '乱世佳人'],
 [416073, 9.0999999999999996, '触不可及'],
 [458107, 9.0, '无间道'],
 [606767, 8.9000000000000004, '当幸福来敲门'],
 [337952, 9.0999999999999996, '天堂电影院'],
 [633995, 8.9000000000000004, '怦然心动'],
 [190977, 9.4000000000000004, '十二怒汉'],
 [434420, 9.0, '搏击俱乐部'],
 [640800, 9.0, '少年派的奇幻漂流'],
 [260089, 9.1999999999999993, '鬼子来了'],
 [367866, 9.0999999999999996, '蝙蝠侠：黑暗骑士'],
 [314885, 9.0999999999999996, '指环王3：王者无敌'],
 [306344, 9.0999999999999996, '活着'],
 [369956, 9.0, '天空之城'],
 [585740, 9.1999999999999993, '疯狂动物城'],
 [426150, 8.9000000000000004, '罗马假日'],
 [451703, 8.9000000000000004, '大话西游之月光宝盒'],
 [554642, 8.9000000000000004, '飞屋环游记'],
 [249586, 9.0999999999999996, '窃听风暴'],
 [296760, 9.0999999999999996, '两杆大烟枪'],
 [111737, 9.5999999999999996, '控方证人'],
 [301329, 9.0, '飞越疯人院'],
 [358755, 8.9000000000000004, '闻香识女人'],
 [393556, 8.9000000000000004, '哈尔的移动城堡'],
 [196094, 9.3000000000000007, '海豚湾'],
 [464601, 8.8000000000000007, 'V字仇杀队'],
 [237421, 9.0999999999999996, '辩护人'],
 [309071, 9.0, '死亡诗社'],
 [207619, 9.0999999999999996, '教父2'],
 [333942, 8.9000000000000004, '美丽心灵'],
 [296196, 9.0, '指环王2：双塔奇兵'],
 [331529, 8.9000000000000004, '指环王1：魔戒再现'],
 [411534, 8.8000000000000007, '情书'],
 [223469, 9.0999999999999996, '饮食男女'],
 [517803, 9.0999999999999996, '摔跤吧！爸爸'],
 [191667, 9.0999999999999996, '美国往事'],
 [309325, 8.9000000000000004, '狮子王'],
 [220420, 9.0, '钢琴家'],
 [520325, 8.6999999999999993, '天使爱美丽'],
 [205704, 9.0999999999999996, '素媛'],
 [469032, 8.6999999999999993, '七宗罪'],
 [153673, 9.1999999999999993, '小鞋子'],
 [320506, 8.9000000000000004, '被嫌弃的松子的一生'],
 [375951, 8.8000000000000007, '致命魔术'],
 [378652, 8.8000000000000007, '看不见的客人'],
 [251308, 8.9000000000000004, '音乐之声'],
 [315215, 8.8000000000000007, '勇敢的心'],
 [523686, 8.6999999999999993, '剪刀手爱德华'],
 [425844, 8.8000000000000007, '本杰明·巴顿奇事'],
 [365086, 8.8000000000000007, '低俗小说'],
 [385562, 8.6999999999999993, '西西里的美丽传说'],
 [307307, 8.8000000000000007, '黑客帝国'],
 [262404, 8.9000000000000004, '拯救大兵瑞恩'],
 [383825, 8.6999999999999993, '沉默的羔羊'],
 [338488, 8.8000000000000007, '入殓师'],
 [414361, 8.6999999999999993, '蝴蝶效应'],
 [677352, 8.6999999999999993, '让子弹飞'],
 [270494, 8.8000000000000007, '春光乍泄'],
 [244643, 8.9000000000000004, '玛丽和马克思'],
 [111733, 9.1999999999999993, '大闹天宫'],
 [295606, 8.8000000000000007, '心灵捕手'],
 [189568, 8.9000000000000004, '末代皇帝'],
 [292721, 8.8000000000000007, '阳光灿烂的日子'],
 [254400, 8.8000000000000007, '幽灵公主'],
 [252833, 8.8000000000000007, '第六感'],
 [359281, 8.6999999999999993, '重庆森林'],
 [389844, 8.6999999999999993, '禁闭岛'],
 [345885, 8.8000000000000007, '布达佩斯大饭店'],
 [271656, 8.6999999999999993, '大鱼'],
 [142601, 9.0, '狩猎'],
 [284871, 8.6999999999999993, '哈利·波特与魔法石'],
 [296911, 8.6999999999999993, '射雕英雄传之东成西就'],
 [344355, 8.5999999999999996, '致命ID'],
 [248165, 8.8000000000000007, '甜蜜蜜'],
 [344588, 8.5999999999999996, '断背山'],
 [251749, 8.6999999999999993, '猫鼠游戏'],
 [166973, 8.9000000000000004, '一一'],
 [367791, 8.6999999999999993, '告白'],
 [289385, 8.8000000000000007, '阳光姐妹淘'],
 [373118, 8.5999999999999996, '加勒比海盗'],
 [166903, 8.9000000000000004, '上帝之城'],
 [97659, 9.1999999999999993, '摩登时代'],
 [162190, 8.9000000000000004, '穿条纹睡衣的男孩'],
 [565530, 8.5999999999999996, '阿凡达'],
 [237864, 8.6999999999999993, '爱在黎明破晓前'],
 [385266, 8.6999999999999993, '消失的爱人'],
 [188690, 8.8000000000000007, '风之谷'],
 [212467, 8.6999999999999993, '爱在日落黄昏时'],
 [181917, 8.8000000000000007, '侧耳倾听'],
 [275127, 8.5999999999999996, '倩女幽魂'],
 [146507, 8.9000000000000004, '红辣椒'],
 [241887, 8.6999999999999993, '恐怖直播'],
 [185888, 8.8000000000000007, '超脱'],
 [217398, 8.6999999999999993, '萤火虫之墓'],
 [304866, 8.6999999999999993, '驯龙高手'],
 [239308, 8.5999999999999996, '幸福终点站'],
 [195650, 8.6999999999999993, '菊次郎的夏天'],
 [144405, 8.9000000000000004, '小森林 夏秋篇'],
 [341432, 8.5, '喜剧之王'],
 [323425, 8.5999999999999996, '岁月神偷'],
 [232077, 8.6999999999999993, '借东西的小人阿莉埃蒂'],
 [82623, 9.1999999999999993, '七武士'],
 [405200, 8.5, '神偷奶爸'],
 [222549, 8.6999999999999993, '杀人回忆'],
 [102681, 9.0, '海洋'],
 [332455, 8.5, '真爱至上'],
 [210611, 8.6999999999999993, '电锯惊魂'],
 [415291, 8.5, '贫民窟的百万富翁'],
 [191225, 8.6999999999999993, '谍影重重3'],
 [149579, 8.8000000000000007, '喜宴'],
 [266681, 8.5999999999999996, '东邪西毒'],
 [295660, 8.5, '记忆碎片'],
 [220414, 8.5999999999999996, '雨人'],
 [257769, 8.5999999999999996, '怪兽电力公司'],
 [440539, 8.5, '黑天鹅'],
 [391224, 8.6999999999999993, '疯狂原始人'],
 [179698, 8.6999999999999993, '英雄本色'],
 [154659, 8.6999999999999993, '燃情岁月'],
 [127219, 8.8000000000000007, '卢旺达饭店'],
 [112345, 8.9000000000000004, '虎口脱险'],
 [189074, 8.6999999999999993, '7号房的礼物'],
 [300454, 8.5, '恋恋笔记本'],
 [125724, 8.9000000000000004, '小森林 冬春篇'],
 [320997, 8.5, '傲慢与偏见'],
 [208380, 8.5999999999999996, '海边的曼彻斯特'],
 [290089, 8.6999999999999993, '哈利·波特与死亡圣器(下)'],
 [168987, 8.6999999999999993, '萤火之森'],
 [138798, 8.8000000000000007, '教父3'],
 [86319, 9.0, '完美的世界'],
 [156471, 8.6999999999999993, '纵横四海'],
 [151799, 8.8000000000000007, '荒蛮故事'],
 [105774, 8.8000000000000007, '二十二'],
 [135526, 8.8000000000000007, '魂断蓝桥'],
 [259388, 8.5, '猜火车'],
 [194663, 8.5999999999999996, '穿越时空的少女'],
 [201714, 8.8000000000000007, '玩具总动员3'],
 [260957, 8.5, '花样年华'],
 [97486, 9.0, '雨中曲'],
 [183786, 8.5999999999999996, '心迷宫'],
 [214531, 8.5999999999999996, '时空恋旅人'],
 [351836, 8.4000000000000004, '唐伯虎点秋香'],
 [392857, 8.5999999999999996, '超能陆战队'],
 [110358, 8.8000000000000007, '我是山姆'],
 [309924, 8.5999999999999996, '蝙蝠侠：黑暗骑士崛起'],
 [199924, 8.5999999999999996, '人工智能'],
 [139242, 8.6999999999999993, '浪潮'],
 [285601, 8.4000000000000004, '冰川时代'],
 [289504, 8.4000000000000004, '香水'],
 [288650, 8.5, '朗读者'],
 [132226, 8.6999999999999993, '罗生门'],
 [174301, 8.8000000000000007, '请以你的名字呼唤我'],
 [251364, 8.5999999999999996, '爆裂鼓手'],
 [85770, 8.9000000000000004, '追随'],
 [138571, 8.6999999999999993, '一次别离'],
 [104317, 8.8000000000000007, '未麻的部屋'],
 [181166, 8.5999999999999996, '撞车'],
 [334741, 8.6999999999999993, '血战钢锯岭'],
 [135259, 8.6999999999999993, '可可西里'],
 [182221, 8.5, '战争之王'],
 [343703, 8.3000000000000007, '恐怖游轮'],
 [89868, 8.8000000000000007, '地球上的星星'],
 [116667, 8.6999999999999993, '梦之安魂曲'],
 [176988, 8.6999999999999993, '达拉斯买家俱乐部'],
 [270993, 8.5999999999999996, '被解救的姜戈'],
 [192717, 8.5, '阿飞正传'],
 [112326, 8.6999999999999993, '牯岭街少年杀人事件'],
 [200329, 8.5, '谍影重重'],
 [166328, 8.5, '谍影重重2'],
 [204653, 8.5, '魔女宅急便'],
 [240090, 8.6999999999999993, '头脑特工队'],
 [164479, 8.8000000000000007, '房间'],
 [63374, 9.0, '忠犬八公物语'],
 [87474, 8.9000000000000004, '惊魂记'],
 [110499, 8.6999999999999993, '碧海蓝天'],
 [179269, 8.5, '再次出发之纽约遇见你'],
 [231647, 8.4000000000000004, '青蛇'],
 [157071, 8.5999999999999996, '小萝莉的猴神大叔'],
 [53476, 9.1999999999999993, '东京物语'],
 [312322, 8.3000000000000007, '秒速5厘米'],
 [84575, 8.9000000000000004, '哪吒闹海'],
 [109454, 8.6999999999999993, '末路狂花'],
 [169778, 8.5999999999999996, '海盗电台'],
 [111040, 8.6999999999999993, '绿里奇迹'],
 [147035, 8.5999999999999996, '终结者2：审判日'],
 [424177, 8.3000000000000007, '源代码'],
 [267159, 8.5999999999999996, '模仿游戏'],
 [192005, 8.5, '新龙门客栈'],
 [162903, 8.5, '黑客帝国3：矩阵革命'],
 [147043, 8.5, '勇闯夺命岛'],
 [189831, 8.5, '这个男人来自地球'],
 [125973, 8.6999999999999993, '一个叫欧维的男人决定去死'],
 [129304, 8.5999999999999996, '卡萨布兰卡'],
 [494602, 8.4000000000000004, '你的名字。'],
 [46323, 9.1999999999999993, '城市之光'],
 [221714, 8.4000000000000004, '变脸'],
 [132083, 8.5999999999999996, '荒野生存'],
 [53099, 9.0999999999999996, '迁徙的鸟'],
 [159426, 8.5, 'E.T. 外星人'],
 [192409, 8.4000000000000004, '发条橙'],
 [231469, 8.4000000000000004, '无耻混蛋'],
 [479894, 8.3000000000000007, '初恋这件小事'],
 [53709, 9.0999999999999996, '黄金三镖客'],
 [191992, 8.4000000000000004, '美国丽人'],
 [121427, 8.8000000000000007, '爱在午夜降临前'],
 [178607, 8.4000000000000004, '英国病人'],
 [60049, 9.0, '无人知晓'],
 [110300, 8.5999999999999996, '燕尾蝶'],
 [120585, 8.5999999999999996, '非常嫌疑犯'],
 [328162, 8.3000000000000007, '疯狂的石头'],
 [112286, 8.5999999999999996, '叫我第一名'],
 [90201, 8.9000000000000004, '勇士'],
 [242926, 8.3000000000000007, '穆赫兰道'],
 [190730, 8.5999999999999996, '无敌破坏王'],
 [352129, 8.3000000000000007, '国王的演讲'],
 [77399, 8.8000000000000007, '步履不停'],
 [137843, 8.5, '血钻'],
 [99101, 8.5999999999999996, '上帝也疯狂'],
 [186988, 8.4000000000000004, '彗星来的那一夜'],
 [103282, 8.5999999999999996, '枪火'],
 [278772, 8.3000000000000007, '蓝色大门'],
 [97025, 8.5999999999999996, '大卫·戈尔的一生'],
 [134046, 8.5, '遗愿清单'],
 [59825, 9.0, '我爱你'],
 [89377, 8.6999999999999993, '千钧一发'],
 [139223, 8.5, '荒岛余生'],
 [48744, 9.0, '爱·回家'],
 [119390, 8.5, '黑鹰坠落'],
 [131277, 8.8000000000000007, '聚焦'],
 [131618, 8.5, '麦兜故事'],
 [148685, 8.4000000000000004, '暖暖内含光']]

In [70]:

    
from IPython.display import display_html, HTML
HTML('<iframe src=http://nbviewer.jupyter.org/github/computational-class/bigdata/blob/gh-pages/vis/douban250bubble.html \
     width=1000 height=500></iframe>')

Out[70]:

作业：¶

抓取复旦新媒体微信公众号最新一期的内容

requests.post模拟登录豆瓣（包括获取验证码）¶

https://blog.csdn.net/zhuzuwei/article/details/80875538

抓取江苏省政协十年提案¶

打开http://www.jszx.gov.cn/zxta/2019ta/

所以数据的更新是使用js推送的

分析network中的内容，发现proposalList.jsp
- 查看它的header，并发现了form_data

http://www.jszx.gov.cn/zxta/2019ta/

In [71]:

import requests
from bs4 import BeautifulSoup

In [72]:

form_data = {'year':2019,
        'pagenum':1,
        'pagesize':20
}
url = 'http://www.jszx.gov.cn/wcm/zxweb/proposalList.jsp'
content = requests.get(url, form_data)
content.encoding = 'utf-8'
js = content.json()

In [74]:

js['data']['totalcount']

Out[74]:

'424'

In [75]:

dat = js['data']['list']
pagenum = js['data']['pagecount']

抓取所有提案的链接¶

In [76]:

for i in range(2, pagenum+1):
    print(i)
    form_data['pagenum'] = i
    content = requests.get(url, form_data)
    content.encoding = 'utf-8'
    js = content.json()
    for j in js['data']['list']:
        dat.append(j)

In [77]:

len(dat)

Out[77]:

In [78]:

dat[0]

Out[78]:

{'personnel_name': '邹正',
 'pkid': '18b1b347f9e34badb8934c2acec80e9e',
 'proposal_number': '0001',
 'publish_time': '2019-01-12 16:04:48',
 'reason': '关于完善城市环卫公厕指引系统的建议',
 'rownum': 1,
 'type': '城乡建设',
 'year': '2019'}

In [79]:

import pandas as pd

df = pd.DataFrame(dat)
df.head()

Out[79]:

	personnel_name	pkid	proposal_number	publish_time	reason	rownum	type	year
0	邹正	18b1b347f9e34badb8934c2acec80e9e	0001	2019-01-12 16:04:48	关于完善城市环卫公厕指引系统的建议	1	城乡建设	2019
1	省政协学习委员会	da43aae2378244faa961dd1224d1343e	0002	2019-01-12 16:04:48	关于加强老小区光纤化改造的建议	2	城乡建设	2019
2	许文前	c0a1626a1bb744ebb0852cf25b21fb0a	0004	2019-01-12 15:42:19	加强科技创新，推动制造业转型升级	3	工业商贸	2019
3	段绪强	ce60d71296764cfe997d62bb2c0990af	0005	2019-01-12 16:21:46	关于落实金融政策、促进民营企业高质量发展的建议	4	财税金融	2019
4	侯建军	8b5fb5a7d86547899835a12af398ffc7	0006	2019-01-12 15:42:19	关于主基地航空公司协同东部机场集团发展的建议	5	工业商贸	2019

In [158]:

df.groupby('type').size()

Out[158]:

type
农林水利     4
医卫体育    45
城乡建设    25
工业商贸    34
政治建设    18
教育事业    58
文化宣传    34
法制建设    24
社会事业    77
科学技术    25
经济发展    52
统战综合     4
财税金融    12
资源环境    24
dtype: int64

抓取提案内容¶

http://www.jszx.gov.cn/zxta/2019ta/index_61.html?pkid=18b1b347f9e34badb8934c2acec80e9e

http://www.jszx.gov.cn/wcm/zxweb/proposalInfo.jsp?pkid=18b1b347f9e34badb8934c2acec80e9e

In [80]:

url_base = 'http://www.jszx.gov.cn/wcm/zxweb/proposalInfo.jsp?pkid='
urls = [url_base + i  for i in df['pkid']]

In [81]:

import sys
def flushPrint(www):
    sys.stdout.write('\r')
    sys.stdout.write('%s' % www)
    sys.stdout.flush()
    
text = []
for k, i in enumerate(urls):
    flushPrint(k)
    content = requests.get(i)
    content.encoding = 'utf-8'
    js = content.json()
    js = js['data']['binfo']['_content']
    soup = BeautifulSoup(js, 'html.parser') 
    text.append(soup.text)

In [82]:

len(text)

Out[82]:

In [83]:

df['content'] = text

In [84]:

df.head()

Out[84]:

	personnel_name	pkid	proposal_number	publish_time	reason	rownum	type	year	content
0	邹正	18b1b347f9e34badb8934c2acec80e9e	0001	2019-01-12 16:04:48	关于完善城市环卫公厕指引系统的建议	1	城乡建设	2019	调研情况： 2015 年 4 月 1 日，习近平总书记首次提出要坚持不懈地推进“厕所革...
1	省政协学习委员会	da43aae2378244faa961dd1224d1343e	0002	2019-01-12 16:04:48	关于加强老小区光纤化改造的建议	2	城乡建设	2019	调研情况：近期，省政协学习委员会组织部分委员对我省信息通信业发展情况进行考察调研，总的感到，...
2	许文前	c0a1626a1bb744ebb0852cf25b21fb0a	0004	2019-01-12 15:42:19	加强科技创新，推动制造业转型升级	3	工业商贸	2019	调研情况：早在2012年，美国国会的一份报告就声称，华为和中兴通讯可能涉嫌从事威胁美国...
3	段绪强	ce60d71296764cfe997d62bb2c0990af	0005	2019-01-12 16:21:46	关于落实金融政策、促进民营企业高质量发展的建议	4	财税金融	2019	调研情况：2018年，国家支持民营企业融资所出台的政策众多、且力度空前。这在一定程度上提振了...
4	侯建军	8b5fb5a7d86547899835a12af398ffc7	0006	2019-01-12 15:42:19	关于主基地航空公司协同东部机场集团发展的建议	5	工业商贸	2019	调研情况：2018年初，在呈报的题为《关于大力发展江苏民航补齐综合交通运输体系短板的几点建议...

In [181]:

df.to_csv('../data/jszx2019.csv', index = False)

In [182]:

dd = pd.read_csv('../data/jszx2019.csv')
dd.head()

Out[182]:

	personnel_name	pkid	proposal_number	publish_time	reason	rownum	type	year	content
0	邹正	18b1b347f9e34badb8934c2acec80e9e	1	2019-01-12 16:04:48	关于完善城市环卫公厕指引系统的建议	1	城乡建设	2019	调研情况： 2015 年 4 月 1 日，习近平总书记首次提出要坚持不懈地推进“厕所革...
1	省政协学习委员会	da43aae2378244faa961dd1224d1343e	2	2019-01-12 16:04:48	关于加强老小区光纤化改造的建议	2	城乡建设	2019	调研情况：近期，省政协学习委员会组织部分委员对我省信息通信业发展情况进行考察调研，总的感到，...
2	韩鸣明	9d9b03f2e78345faa265eb99ce49e97e	3	2019-01-12 16:24:23	关于加快建立省民营经济发展推进机制的建议	3	经济发展	2019	调研情况：习近平总书记在全国民营企业座谈会上指出，要把支持民营企业发展作为一项重要任...
3	许文前	c0a1626a1bb744ebb0852cf25b21fb0a	4	2019-01-12 15:42:19	加强科技创新，推动制造业转型升级	4	工业商贸	2019	调研情况：早在2012年，美国国会的一份报告就声称，华为和中兴通讯可能涉嫌从事威胁美国...
4	段绪强	ce60d71296764cfe997d62bb2c0990af	5	2019-01-12 16:21:46	深化落实金融政策举措 ,促进民营企业高质量发展	5	财税金融	2019	调研情况：2018年，国家支持民营企业融资所出台的政策众多、且力度空前。这在一定程度上提振了...

In [ ]:

数据抓取：¶

Requests、Beautifulsoup、Xpath简介¶

爬虫基本原理¶

需要解决的问题¶

第一个爬虫¶

Beautiful Soup¶

Install beautifulsoup4¶

open your terminal/cmd¶

html.parser¶

lxml¶

html5lib¶

Select 方法¶

Select方法三步骤¶

Select 方法: 通过标签名查找¶

Select 方法: 通过类名查找¶

Select 方法: 通过id名查找¶

Select 方法: 组合查找¶

Select 方法:属性查找¶

find_all方法¶

数据抓取：¶

抓取微信公众号文章内容¶

南大新传 | 微议题：地震中民族自豪—“中国人先撤”

朋友会在“发现-看一看”看到你“在看”的内容

朋友将在看一看看到

发布到看一看

查看源代码 Inspect¶

wechatsogou¶

requests + Xpath方法介绍：以豆瓣电影为例¶

Douban API¶

作业：抓取豆瓣电影 Top 250¶

作业：¶

requests.post模拟登录豆瓣（包括获取验证码）¶

抓取江苏省政协十年提案¶

抓取所有提案的链接¶

抓取提案内容¶