网络爬虫之scrapy框架详解
twisted介绍
Twisted是用Python实现的基于事件驱动的网络引擎框架,scrapy正是依赖于twisted,
它是基于事件循环的异步非阻塞网络框架,可以实现爬虫的并发。
twisted是什么以及和requests的区别:
- request是一个python实现的可以伪造浏览器发送Http请求的模块,它封装了socket发送请求
- twisted是基于时间循环的异步非阻塞的网络框架,它也封装了socket发送请求,但是他可以单线程的完成并发请求。
twisted的特点是:
- 非阻塞:不等待
- 异步:回调
- 事件循环:一直循环去检查状态
scrapy的pipeline文件和items文件
这两个文件有什么作用
先看看我们上篇的示例:
在这个示例中,虽然我们已经通过chouti.py一个文件中的parse方法实现了爬去抽屉网的新闻并将之保存在文件中的功能,
但是我们会发现有两个问题:
1、在循环爬去每一页的时候,每次都需要重新打开然后再关闭文件,如果数据量庞大的话,这对性能有很大的影响。
2、我们将解析和数据持久化都放在了同一个文件的同一个方法中,没有做到分工明确
如果要解决这两个问题,则需要用到scrapy自动为我们生成的pipeline文件和items文件
这两个文件怎么用
如果我们要使用这两个文件从而解决问题,则需要有四部操作:
a.编写pipeline文件中的类,格式如下:
1
2
3
|
python keyword">class
python plain">XXXPipeline(
python functions">object
python plain">):
python spaces">
python keyword">def
python plain">process_item(
python color1">self
python plain">, item, spider):
python spaces">
python keyword">return
python plain">item
|
b.编写items文件中的类,格式如下:
1
2
3
|
python keyword">class
python plain">XXXItem(scrapy.Item):
python spaces">
python plain">href
python keyword">=
python plain">scrapy.Field()
python spaces">
python plain">title
python keyword">=
python plain">scrapy.Field()
|
c.配置settings文件
1
2
3
4
|
python plain">ITEM_PIPELINES
python keyword">=
python plain">{
python spaces">
python string">'xxx.pipelines.XXXPipeline'
python plain">:
python value">300
python plain">,
python spaces">
python comments"># 'xxx.pipelines.XXXPipeline2': 600, # 后面的数字为优先级,数字越大,优先级月底
python plain">}
|
d.在parse方法中yield一个Item对象
1
2
3
4
5
|
python keyword">from
python plain">xxx.items
python keyword">import
python plain">XXXItem
python keyword">def
python plain">parse(
python color1">self
python plain">, response):
python spaces">
python plain">...
python spaces">
python keyword">yield
python plain">XXXItem(text
python keyword">=
python plain">text,href
python keyword">=
python plain">href)
|
执行流程为:
当我们在执行爬虫中的parse方法的时候,scrapy一旦解析到有yield XXXitem的语句,就会到配置文件中找
ITEM_PIPELINES的配置项,进而找到XXXPipeline类,然后执行其中的方法,我们就可以在方法中做很多操作
当然,pipeline中不止process_item一个方法。
Pipeline中的方法详解
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
python keyword">class
python plain">FilePipeline(
python functions">object
python plain">):
python spaces">
python keyword">def
python plain">__init__(
python color1">self
python plain">,path):
python spaces">
python color1">self
python plain">.f
python keyword">=
python color1">None
python spaces">
python color1">self
python plain">.path
python keyword">=
python plain">path
python spaces">
python decorator">@classmethod
python spaces">
python keyword">def
python plain">from_crawler(
python color1">cls
python plain">, crawler):
python spaces">
python comments">"""
python spaces">
python comments">初始化时候,用于创建pipeline对象
python spaces">
python comments">:param crawler:
python spaces">
python comments">:return:
python spaces">
python comments">"""
python spaces">
python comments"># 从配置文件中获取配置好的文件存放目录
python spaces">
python plain">path
python keyword">=
python plain">crawler.settings.get(
python string">'HREF_FILE_PATH'
python plain">)
python spaces">
python keyword">return
python color1">cls
python plain">(path)
python spaces">
python keyword">def
python plain">open_spider(
python color1">self
python plain">,spider):
python spaces">
python comments">"""
python spaces">
python comments">爬虫开始执行时,调用
python spaces">
python comments">:param spider:
python spaces">
python comments">:return:
python spaces">
python comments">"""
python spaces">
python color1">self
python plain">.f
python keyword">=
python functions">open
python plain">(
python color1">self
python plain">.path,
python string">'a+'
python plain">)
python spaces">
python keyword">def
python plain">process_item(
python color1">self
python plain">, item, spider):
python spaces">
python comments"># 在这里做持久化
python spaces">
python color1">self
python plain">.f.write(item[
python string">'href'
python plain">]
python keyword">+
python string">'\n'
python plain">)
python spaces">
python keyword">return
python plain">item
python comments"># 交给下一个pipeline的process_item方法
python spaces">
python comments"># raise DropItem()# 如果写上这一句,后续的 pipeline的process_item方法不再执行
python spaces">
python keyword">def
python plain">close_spider(
python color1">self
python plain">,spider):
python spaces">
python comments">"""
python spaces">
python comments">爬虫关闭时,被调用
python spaces">
python comments">:param spider:
python spaces">
python comments">:return:
python spaces">
python comments">"""
python spaces">
python color1">self
python plain">.f.close()
|
去重
scrapy内部实现的去重
从上一篇的例子我们可以看出,其实scrapy内部在循环爬去页码的时候,已经帮我们做了去重功能的,
因为我们在首页可以看到1,2,3,4,5,6,7,8,9,10页的页码以及连接,当爬虫爬到第二页的时候,
还是可以看到这10个页面及连接,然后它并没有再重新把第一页爬一遍。
它内部实现去重的原理是,将已爬去的网址存入一个set集合里,每次爬取新页面的时候就先看一下是否在集合里面
如果在,就不再爬去,如果不在就爬取,然后再添加入到set里。当然,这个集合存放的不是原网址,
而是将链接通过request_fingerprint()方法将它变成一个类似于md5的值,这样可以节省存储空间
自定义去重
虽然scrapy已经帮我们实现了去重,但是有时候不足以满足我们的需求,这样就需要我们自定义去重了
自定义去重分两步
1、编写DupeFilter类
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
python keyword">from
python plain">scrapy.dupefilter
python keyword">import
python plain">BaseDupeFilter
python keyword">from
python plain">scrapy.utils.request
python keyword">import
python plain">request_fingerprint
python keyword">class
python plain">XXXDupeFilter(BaseDupeFilter):
python spaces">
python keyword">def
python plain">__init__(
python color1">self
python plain">):
python spaces">
python comments">'''初始化一个集合,用来存放爬去过的网址'''
python spaces">
python color1">self
python plain">.visited_fd
python keyword">=
python functions">set
python plain">()
python spaces">
python decorator">@classmethod
python spaces">
python keyword">def
python plain">from_settings(
python color1">cls
python plain">, settings):
python spaces">
python comments">'''
python spaces">
python comments">如果我们自定义了DupeFilter类并且重写了父类的该方法,
python spaces">
python comments">scrapy会首先执行该方法,获取DupeFilter对象,
python spaces">
python comments">如果没有定义,则会执行init方法来获取对象
python spaces">
python comments">'''
python spaces">
python keyword">return
python color1">cls
python plain">()
python spaces">
python keyword">def
python plain">request_seen(
python color1">self
python plain">, request):
python spaces">
python comments">'''在此方法中做操作,判断以及添加网址到set里'''
python spaces">
python comments"># 将request里的url转换下,然后判断是否在set里
python spaces">
python plain">fd
python keyword">=
python plain">request_fingerprint(request
python keyword">=
python plain">request)
python spaces">
python comments"># 循环set集合,如果已经在集合里,则返回True,爬虫将不会继续爬取该网址
python spaces">
python keyword">if
python plain">fd
python keyword">in
python color1">self
python plain">.visited_fd:
python spaces">
python keyword">return
python color1">True
python spaces">
python color1">self
python plain">.visited_fd.add(fd)
python spaces">
python keyword">def
python functions">open
python plain">(
python color1">self
python plain">):
python comments"># can return deferred
python spaces">
python comments">'''开始前执行此方法'''
python spaces">
python functions">print
python plain">(
python string">'开始'
python plain">)
python spaces">
python keyword">def
python plain">close(
python color1">self
python plain">, reason):
python comments"># can return a deferred
python spaces">
python comments">'''结束后执行此方法'''
python spaces">
python functions">print
python plain">(
python string">'结束'
python plain">)
python spaces">
python keyword">def
python plain">log(
python color1">self
python plain">, request, spider):
python comments"># log that a request has been filtered
python spaces">
python comments">'''在此方法中可以做日志操作'''
python spaces">
python functions">print
python plain">(
python string">'日志'
python plain">)
|
2.配置settings文件
1
2
3
|
python comments"># 修改默认的去重规则
python comments"># DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
python plain">DUPEFILTER_CLASS
python keyword">=
python string">'xxx.dupefilters.XXXDupeFilter'
|
深度
深度就是爬虫所要爬取的层级
限制深度只需要配置一下即可
1
2
|
python comments"># 限制深度
python plain">DEPTH_LIMIT
python keyword">=
python value">3
|
cookie
获取上一次请求之后获得的cookie
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
python keyword">from
python plain">scrapy.http.cookies
python keyword">import
python plain">CookieJar
python keyword">class
python plain">ChoutiSpider(scrapy.Spider):
python spaces">
python plain">name
python keyword">=
python string">'chouti'
python spaces">
python plain">allowed_domains
python keyword">=
python plain">[
python string">'chouti.com'
python plain">]
python spaces">
python plain">start_urls
python keyword">=
python plain">[
python string">'https://dig.chouti.com/'
python plain">]
python spaces">
python plain">cookie_dict
python keyword">=
python plain">{}
python spaces">
python keyword">def
python plain">parse(
python color1">self
python plain">, response):
python spaces">
python comments"># 去响应头中获取cookie,cookie保存在cookie_jar对象
python spaces">
python plain">cookie_jar
python keyword">=
python plain">CookieJar()
python spaces">
python plain">cookie_jar.extract_cookies(response, response.request)
python spaces">
python comments"># 去对象中将cookie解析到字典
python spaces">
python keyword">for
python plain">k, v
python keyword">in
python plain">cookie_jar._cookies.items():
python spaces">
python keyword">for
python plain">i, j
python keyword">in
python plain">v.items():
python spaces">
python keyword">for
python plain">m, n
python keyword">in
python plain">j.items():
python spaces">
python color1">self
python plain">.cookie_dict[m]
python keyword">=
python plain">n.value
|
再次请求的时候携带cookie
1
2
3
4
5
6
7
8
9
10
|
python keyword">yield
python plain">Request(
python spaces">
python plain">url
python keyword">=
python string">'https://dig.chouti.com/login'
python plain">,
python spaces">
python plain">method
python keyword">=
python string">'POST'
python plain">,
python spaces">
python plain">body
python keyword">=
python string">"phone=861300000000&password=12345678&oneMonth=1"
python plain">,
python comments">#
python spaces">
python plain">cookies
python keyword">=
python color1">self
python plain">.cookie_dict,
python spaces">
python plain">headers
python keyword">=
python plain">{
python spaces">
python string">'Content-Type'
python plain">:
python string">'application/x-www-form-urlencoded; charset=UTF-8'
python spaces">
python plain">},
python spaces">
python plain">callback
python keyword">=
python color1">self
python plain">.check_login
python spaces">
python plain">)
|
是不是感觉很麻烦?
那么,呵呵,其实,嘿嘿,
你只需要在Request对象的参数中加入 meta={'cookiejar': True} 即可!
网络爬虫之scrapy框架设置代理
前戏
os.environ()简介
os.environ()可以获取到当前进程的环境变量,注意,是当前进程。
如果我们在一个程序中设置了环境变量,另一个程序是无法获取设置的那个变量的。
环境变量是以一个字典的形式存在的,可以用字典的方法来取值或者设置值。
os.environ() key字段详解
windows:
1
2
3
4
5
6
|
python plain">os.environ[
python string">'HOMEPATH'
python plain">]:当前用户主目录。
python plain">os.environ[
python string">'TEMP'
python plain">]:临时目录路径。
python plain">os.environ[PATHEXT']:可执行文件。
python plain">os.environ[
python string">'SYSTEMROOT'
python plain">]:系统主目录。
python plain">os.environ[
python string">'LOGONSERVER'
python plain">]:机器名。
python plain">os.environ[
python string">'PROMPT'
python plain">]:设置提示符。
|
linux:
1
2
3
4
5
|
python plain">os.environ[
python string">'USER'
python plain">]:当前使用用户。
python plain">os.environ[
python string">'LC_COLLATE'
python plain">]:路径扩展的结果排序时的字母顺序。
python plain">os.environ[
python string">'SHELL'
python plain">]:使用shell的类型。
python plain">os.environ[
python string">'LAN'
python plain">]:使用的语言。
python plain">os.environ[
python string">'SSH_AUTH_SOCK'
python plain">]:ssh的执行路径。
|
内置的方式
原理
scrapy框架内部已经实现了设置代理的方法,它的原理是从环境变量中取出设置的代理,然后再使用,
所以我们只需要在程序执行前将代理以键值对的方式设置到环境变量中即可。
代码
第一种方式:直接添加键值对的方式
1
2
3
4
5
6
7
8
9
10
11
12
|
python keyword">class
python plain">ChoutiSpider(scrapy.Spider):
python spaces">
python plain">name
python keyword">=
python string">'chouti'
python spaces">
python plain">allowed_domains
python keyword">=
python plain">[
python string">'chouti.com'
python plain">]
python spaces">
python plain">start_urls
python keyword">=
python plain">[
python string">'https://dig.chouti.com/'
python plain">]
python spaces">
python plain">cookie_dict
python keyword">=
python plain">{}
python spaces">
python keyword">def
python plain">start_requests(
python color1">self
python plain">):
python spaces">
python keyword">import
python plain">os
python spaces">
python plain">os.environ[
python string">'HTTPS_PROXY'
python plain">]
python keyword">=
python string">"http://username:password@192.168.11.11:9999/"
python spaces">
python plain">os.environ[
python string">'HTTP_PROXY'
python plain">]
python keyword">=
python string">'19.11.2.32'
python plain">,
python spaces">
python keyword">for
python plain">url
python keyword">in
python color1">self
python plain">.start_urls:
python spaces">
python keyword">yield
python plain">Request(url
python keyword">=
python plain">url,callback
python keyword">=
python color1">self
python plain">.parse)
|
第二种方式:设置meta参数的方式
1
2
3
4
5
6
7
8
9
|
python keyword">class
python plain">ChoutiSpider(scrapy.Spider):
python spaces">
python plain">name
python keyword">=
python string">'chouti'
python spaces">
python plain">allowed_domains
python keyword">=
python plain">[
python string">'chouti.com'
python plain">]
python spaces">
python plain">start_urls
python keyword">=
python plain">[
python string">'https://dig.chouti.com/'
python plain">]
python spaces">
python plain">cookie_dict
python keyword">=
python plain">{}
python spaces">
python keyword">def
python plain">start_requests(
python color1">self
python plain">):
python spaces">
python keyword">for
python plain">url
python keyword">in
python color1">self
python plain">.start_urls:
python spaces">
python keyword">yield
python plain">Request(url
python keyword">=
python plain">url,callback
python keyword">=
python color1">self
python plain">.parse,meta
python keyword">=
python plain">{
python string">'proxy'
python plain">:
python string">'"http://username:password@192.168.11.11:9999/"'
python plain">})
|
自定义方式
原理
我们可以根据内部实现的添加代理的类(中间件)的实现方法,来对它进行升级,比如内部的方式一次只能使用一个代理,
我们可以弄一个列表,装很多代理地址,然后随机选取一个代理,这样可以防止请求过多被封ip
代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
python keyword">class
python plain">ChoutiSpider(scrapy.Spider):
python spaces">
python plain">name
python keyword">=
python string">'chouti'
python spaces">
python plain">allowed_domains
python keyword">=
python plain">[
python string">'chouti.com'
python plain">]
python spaces">
python plain">start_urls
python keyword">=
python plain">[
python string">'https://dig.chouti.com/'
python plain">]
python spaces">
python plain">cookie_dict
python keyword">=
python plain">{}
python spaces">
python keyword">def
python plain">start_requests(
python color1">self
python plain">):
python spaces">
python keyword">for
python plain">url
python keyword">in
python color1">self
python plain">.start_urls:
python spaces">
python keyword">yield
python plain">Request(url
python keyword">=
python plain">url,callback
python keyword">=
python color1">self
python plain">.parse,meta
python keyword">=
python plain">{
python string">'proxy'
python plain">:
python string">'"http://username:password@192.168.11.11:9999/"'
python plain">})
python spaces">
python keyword">import
python plain">base64
python spaces">
python keyword">import
python plain">random
python spaces">
python keyword">from
python plain">six.moves.urllib.parse
python keyword">import
python plain">unquote
python spaces">
python keyword">try
python plain">:
python spaces">
python keyword">from
python plain">urllib2
python keyword">import
python plain">_parse_proxy
python spaces">
python keyword">except
python plain">ImportError:
python spaces">
python keyword">from
python plain">urllib.request
python keyword">import
python plain">_parse_proxy
python spaces">
python keyword">from
python plain">six.moves.urllib.parse
python keyword">import
python plain">urlunparse
python spaces">
python keyword">from
python plain">scrapy.utils.python
python keyword">import
python plain">to_bytes
python spaces">
python keyword">class
python plain">XXProxyMiddleware(
python functions">object
python plain">):
python spaces">
python keyword">def
python plain">_basic_auth_header(
python color1">self
python plain">, username, password):
python spaces">
python plain">user_pass
python keyword">=
python plain">to_bytes(
python spaces">
python string">'%s:%s'
python keyword">%
python plain">(unquote(username), unquote(password)),
python spaces">
python plain">encoding
python keyword">=
python string">'latin-1'
python plain">)
python spaces">
python keyword">return
python plain">base64.b64encode(user_pass).strip()
python spaces">
python keyword">def
python plain">process_request(
python color1">self
python plain">, request, spider):
python spaces">
python plain">PROXIES
python keyword">=
python plain">[
python spaces">
python string">"http://username:password@192.168.11.11:9999/"
python plain">,
python spaces">
python string">"http://username:password@192.168.11.12:9999/"
python plain">,
python spaces">
python string">"http://username:password@192.168.11.13:9999/"
python plain">,
python spaces">
python string">"http://username:password@192.168.11.14:9999/"
python plain">,
python spaces">
python string">"http://username:password@192.168.11.15:9999/"
python plain">,
python spaces">
python string">"http://username:password@192.168.11.16:9999/"
python plain">,
python spaces">
python plain">]
python spaces">
python plain">url
python keyword">=
python plain">random.choice(PROXIES)
python spaces">
python plain">orig_type
python keyword">=
python plain">""
python spaces">
python plain">proxy_type, user, password, hostport
python keyword">=
python plain">_parse_proxy(url)
python spaces">
python plain">proxy_url
python keyword">=
python plain">urlunparse((proxy_type
python keyword">or
python plain">orig_type, hostport, '
python string">', '
python string">', '
python string">', '
python plain">'))
python spaces">
python keyword">if
python plain">user:
python spaces">
python plain">creds
python keyword">=
python color1">self
python plain">._basic_auth_header(user, password)
python spaces">
python keyword">else
python plain">:
python spaces">
python plain">creds
python keyword">=
python color1">None
python spaces">
python plain">request.meta[
python string">'proxy'
python plain">]
python keyword">=
python plain">proxy_url
python spaces">
python keyword">if
python plain">creds:
python spaces">
python plain">request.headers[
python string">'Proxy-Authorization'
python plain">]
python keyword">=
python plain">b
python string">'Basic '
python keyword">+
python plain">creds
|
写完类之后需要在settings文件里配置一下:
1
2
3
|
python plain">DOWNLOADER_MIDDLEWARES
python keyword">=
python plain">{
python spaces">
python string">'spider.xxx.XXXProxyMiddleware'
python plain">:
python value">543
python plain">,
python plain">}
|