网络爬虫之scrapy框架详解

twisted介绍

Twisted是用Python实现的基于事件驱动的网络引擎框架，scrapy正是依赖于twisted，

它是基于事件循环的异步非阻塞网络框架，可以实现爬虫的并发。

twisted是什么以及和requests的区别：

request是一个python实现的可以伪造浏览器发送Http请求的模块，它封装了socket发送请求
twisted是基于时间循环的异步非阻塞的网络框架，它也封装了socket发送请求，但是他可以单线程的完成并发请求。

twisted的特点是：

非阻塞：不等待
异步：回调
事件循环：一直循环去检查状态

scrapy的pipeline文件和items文件

这两个文件有什么作用

先看看我们上篇的示例：

python"> 
     python comments"># -*- coding: utf-8 -*-
       python keyword">import 
       python plain">scrapy
      

       python spaces">  
      

       python spaces">  
      

       python keyword">class 
       python plain">ChoutiSpider(scrapy.Spider):
      

       python spaces">    
       python comments">'''
      

       python spaces">    
       python comments">爬去抽屉网的帖子信息
      

       python spaces">    
       python comments">'''
      

       python spaces">    
       python plain">name 
       python keyword">= 
       python string">'chouti'
      

       python spaces">    
       python plain">allowed_domains 
       python keyword">= 
       python plain">[
       python string">'chouti.com'
       python plain">]
      

       python spaces">    
       python plain">start_urls 
       python keyword">= 
       python plain">[
       python string">'http://chouti.com/'
       python plain">]
      

       python spaces">  
      

       python spaces">    
       python keyword">def 
       python plain">parse(
       python color1">self
       python plain">, response):
      

       python spaces">        
       python comments"># 获取帖子列表的父级div
      

       python spaces">        
       python plain">content_div 
       python keyword">= 
       python plain">response.xpath(
       python string">'//div[@id="content-list"]'
       python plain">)
      

       python spaces">  
      

       python spaces">        
       python comments"># 获取帖子item的列表
      

       python spaces">        
       python plain">items_list 
       python keyword">= 
       python plain">content_div.xpath(
       python string">'.//div[@class="item"]'
       python plain">)
      

       python spaces">  
      

       python spaces">        
       python comments"># 打开一个文件句柄，目的是为了将获取的东西写入文件
      

       python spaces">        
       python plain">with 
       python functions">open
       python plain">(
       python string">'articles.log'
       python plain">,
       python string">'a+'
       python plain">,encoding
       python keyword">=
       python string">'utf-8'
       python plain">) as f:
      

       python spaces">            
       python comments"># 循环item_list
      

       python spaces">            
       python keyword">for 
       python plain">item 
       python keyword">in 
       python plain">items_list:
      

       python spaces">                
       python comments"># 获取每个item的第一个a标签的文本和url链接
      

       python spaces">                
       python plain">text 
       python keyword">= 
       python plain">item.xpath(
       python string">'.//a/text()'
       python plain">).extract_first()
      

       python spaces">                
       python plain">href 
       python keyword">= 
       python plain">item.xpath(
       python string">'.//a/@href'
       python plain">).extract_first()
      

       python spaces">                
       python comments"># print(href, text.strip())
      

       python spaces">                
       python comments"># print('-'*100)
      

       python spaces">                
       python plain">f.write(href
       python keyword">+
       python string">'\n'
       python plain">)
      

       python spaces">                
       python plain">f.write(text.strip()
       python keyword">+
       python string">'\n'
       python plain">)
      

       python spaces">                
       python plain">f.write(
       python string">'-'
       python keyword">*
       python value">100
       python keyword">+
       python string">'\n'
       python plain">)
      

       python spaces">  
      

       python spaces">        
       python comments"># 获取分页的页码，然后让程序循环爬去每个链接
      

       python spaces">        
       python comments"># 页码标签对象列表
      

       python spaces">        
       python plain">page_list 
       python keyword">= 
       python plain">response.xpath(
       python string">'//div[@id="dig_lcpage"]'
       python plain">)
      

       python spaces">        
       python comments"># 循环列表
      

       python spaces">        
       python keyword">for 
       python plain">page 
       python keyword">in 
       python plain">page_list:
      

       python spaces">            
       python comments"># 获取每个标签下的a标签的url，即每页的链接
      

       python spaces">            
       python plain">page_a_url 
       python keyword">= 
       python plain">page.xpath(
       python string">'.//a/@href'
       python plain">).extract()
      

       python spaces">            
       python comments"># 将域名和url拼接起来
      

       python spaces">            
       python plain">page_url 
       python keyword">= 
       python string">'https://dig.chouti.com' 
       python keyword">+ 
       python plain">page_a_url
      

       python spaces">  
      

       python spaces">            
       python comments"># 重要的一步！！！！
      

       python spaces">            
       python comments"># 导入Request模块，然后实例化一个Request对象，然后yield它
      

       python spaces">            
       python comments"># 就会自动执行Request对象的callback方法，爬去的是url参数中的链接
      

       python spaces">            
       python keyword">from 
       python plain">scrapy.http 
       python keyword">import 
       python plain">Request
      

       python spaces">            
       python keyword">yield 
       python plain">Request(url
       python keyword">=
       python plain">page_url,callback
       python keyword">=
       python color1">self
       python plain">.parse)
      

　　在这个示例中，虽然我们已经通过chouti.py一个文件中的parse方法实现了爬去抽屉网的新闻并将之保存在文件中的功能，

但是我们会发现有两个问题：

1、在循环爬去每一页的时候，每次都需要重新打开然后再关闭文件，如果数据量庞大的话，这对性能有很大的影响。

2、我们将解析和数据持久化都放在了同一个文件的同一个方法中，没有做到分工明确

如果要解决这两个问题，则需要用到scrapy自动为我们生成的pipeline文件和items文件

这两个文件怎么用

如果我们要使用这两个文件从而解决问题，则需要有四部操作：

a.编写pipeline文件中的类，格式如下：

python"> 
      
            python keyword">class 
            python plain">XXXPipeline(
            python functions">object
            python plain">):
           

            python spaces">    
            python keyword">def 
            python plain">process_item(
            python color1">self
            python plain">, item, spider):
           

            python spaces">        
            python keyword">return 
            python plain">item
           

     

b.编写items文件中的类，格式如下：

python"> 
      
            python keyword">class 
            python plain">XXXItem(scrapy.Item):
           
            python spaces">    
            python plain">href 
            python keyword">= 
            python plain">scrapy.Field()
           
            python spaces">    
            python plain">title 
            python keyword">= 
            python plain">scrapy.Field()

c.配置settings文件

python"> 
      
            python plain">ITEM_PIPELINES 
            python keyword">= 
            python plain">{
           
            python spaces">   
            python string">'xxx.pipelines.XXXPipeline'
            python plain">: 
            python value">300
            python plain">,
           
            python spaces">   
            python comments"># 'xxx.pipelines.XXXPipeline2': 600,  # 后面的数字为优先级，数字越大，优先级月底
           
            python plain">}

d.在parse方法中yield一个Item对象

python"> 
      
            python keyword">from 
            python plain">xxx.items 
            python keyword">import 
            python plain">XXXItem
           
            python keyword">def 
            python plain">parse(
            python color1">self
            python plain">, response):
           
            python spaces">    
            python plain">...
           
            python spaces">    
            python keyword">yield 
            python plain">XXXItem(text
            python keyword">=
            python plain">text,href
            python keyword">=
            python plain">href)

执行流程为：

当我们在执行爬虫中的parse方法的时候，scrapy一旦解析到有yield XXXitem的语句，就会到配置文件中找

ITEM_PIPELINES的配置项，进而找到XXXPipeline类，然后执行其中的方法，我们就可以在方法中做很多操作

当然，pipeline中不止process_item一个方法。

Pipeline中的方法详解

python">

            python keyword">class 
            python plain">FilePipeline(
            python functions">object
            python plain">):
           
            python spaces">    
            python keyword">def 
            python plain">__init__(
            python color1">self
            python plain">,path):
           
            python spaces">        
            python color1">self
            python plain">.f 
            python keyword">= 
            python color1">None
           
            python spaces">        
            python color1">self
            python plain">.path 
            python keyword">= 
            python plain">path
           
            python spaces">    
            python decorator">@classmethod
           
            python spaces">    
            python keyword">def 
            python plain">from_crawler(
            python color1">cls
            python plain">, crawler):
           
            python spaces">        
            python comments">"""
           
            python spaces">        
            python comments">初始化时候，用于创建pipeline对象
           
            python spaces">        
            python comments">:param crawler:
           
            python spaces">        
            python comments">:return:
           
            python spaces">        
            python comments">"""
           
            python spaces">                
            python comments"># 从配置文件中获取配置好的文件存放目录
           
            python spaces">        
            python plain">path 
            python keyword">= 
            python plain">crawler.settings.get(
            python string">'HREF_FILE_PATH'
            python plain">)
           
            python spaces">        
            python keyword">return 
            python color1">cls
            python plain">(path)
           
            python spaces">    
            python keyword">def 
            python plain">open_spider(
            python color1">self
            python plain">,spider):
           
            python spaces">        
            python comments">"""
           
            python spaces">        
            python comments">爬虫开始执行时，调用
           
            python spaces">        
            python comments">:param spider:
           
            python spaces">        
            python comments">:return:
           
            python spaces">        
            python comments">"""
           
            python spaces">        
            python color1">self
            python plain">.f 
            python keyword">= 
            python functions">open
            python plain">(
            python color1">self
            python plain">.path,
            python string">'a+'
            python plain">)
           
            python spaces">    
            python keyword">def 
            python plain">process_item(
            python color1">self
            python plain">, item, spider):
           
            python spaces">        
            python comments"># 在这里做持久化
           
            python spaces">        
            python color1">self
            python plain">.f.write(item[
            python string">'href'
            python plain">]
            python keyword">+
            python string">'\n'
            python plain">)
           
            python spaces">        
            python keyword">return 
            python plain">item     
            python comments"># 交给下一个pipeline的process_item方法
           
            python spaces">        
            python comments"># raise DropItem()# 如果写上这一句，后续的 pipeline的process_item方法不再执行
           
            python spaces">    
            python keyword">def 
            python plain">close_spider(
            python color1">self
            python plain">,spider):
           
            python spaces">        
            python comments">"""
           
            python spaces">        
            python comments">爬虫关闭时，被调用
           
            python spaces">        
            python comments">:param spider:
           
            python spaces">        
            python comments">:return:
           
            python spaces">        
            python comments">"""
           
            python spaces">        
            python color1">self
            python plain">.f.close()

去重

scrapy内部实现的去重

从上一篇的例子我们可以看出，其实scrapy内部在循环爬去页码的时候，已经帮我们做了去重功能的，

因为我们在首页可以看到1,2,3,4,5,6,7,8,9,10页的页码以及连接，当爬虫爬到第二页的时候，

还是可以看到这10个页面及连接，然后它并没有再重新把第一页爬一遍。

它内部实现去重的原理是，将已爬去的网址存入一个set集合里，每次爬取新页面的时候就先看一下是否在集合里面

如果在，就不再爬去，如果不在就爬取，然后再添加入到set里。当然，这个集合存放的不是原网址，

而是将链接通过request_fingerprint()方法将它变成一个类似于md5的值，这样可以节省存储空间

自定义去重

虽然scrapy已经帮我们实现了去重，但是有时候不足以满足我们的需求，这样就需要我们自定义去重了

自定义去重分两步

1、编写DupeFilter类

python">

            python keyword">from 
            python plain">scrapy.dupefilter 
            python keyword">import 
            python plain">BaseDupeFilter
           
            python keyword">from 
            python plain">scrapy.utils.request 
            python keyword">import 
            python plain">request_fingerprint
           
            python keyword">class 
            python plain">XXXDupeFilter(BaseDupeFilter):
           
            python spaces">    
            python keyword">def 
            python plain">__init__(
            python color1">self
            python plain">):
           
            python spaces">        
            python comments">'''初始化一个集合，用来存放爬去过的网址'''
           
            python spaces">        
            python color1">self
            python plain">.visited_fd 
            python keyword">= 
            python functions">set
            python plain">()
           
            python spaces">    
            python decorator">@classmethod
           
            python spaces">    
            python keyword">def 
            python plain">from_settings(
            python color1">cls
            python plain">, settings):
           
            python spaces">        
            python comments">'''
           
            python spaces">        
            python comments">如果我们自定义了DupeFilter类并且重写了父类的该方法，
           
            python spaces">        
            python comments">scrapy会首先执行该方法，获取DupeFilter对象，
           
            python spaces">        
            python comments">如果没有定义，则会执行init方法来获取对象
           
            python spaces">        
            python comments">'''
           
            python spaces">        
            python keyword">return 
            python color1">cls
            python plain">()
           
            python spaces">    
            python keyword">def 
            python plain">request_seen(
            python color1">self
            python plain">, request):
           
            python spaces">        
            python comments">'''在此方法中做操作，判断以及添加网址到set里'''
           
            python spaces">        
            python comments"># 将request里的url转换下，然后判断是否在set里
           
            python spaces">        
            python plain">fd 
            python keyword">= 
            python plain">request_fingerprint(request
            python keyword">=
            python plain">request)
           
            python spaces">        
            python comments"># 循环set集合，如果已经在集合里，则返回True，爬虫将不会继续爬取该网址
           
            python spaces">        
            python keyword">if 
            python plain">fd 
            python keyword">in 
            python color1">self
            python plain">.visited_fd:
           
            python spaces">            
            python keyword">return 
            python color1">True
           
            python spaces">        
            python color1">self
            python plain">.visited_fd.add(fd)
           
            python spaces">    
            python keyword">def 
            python functions">open
            python plain">(
            python color1">self
            python plain">):  
            python comments"># can return deferred
           
            python spaces">        
            python comments">'''开始前执行此方法'''
           
            python spaces">        
            python functions">print
            python plain">(
            python string">'开始'
            python plain">)
           
            python spaces">    
            python keyword">def 
            python plain">close(
            python color1">self
            python plain">, reason):  
            python comments"># can return a deferred
           
            python spaces">        
            python comments">'''结束后执行此方法'''
           
            python spaces">        
            python functions">print
            python plain">(
            python string">'结束'
            python plain">)
           
            python spaces">    
            python keyword">def 
            python plain">log(
            python color1">self
            python plain">, request, spider):  
            python comments"># log that a request has been filtered
           
            python spaces">        
            python comments">'''在此方法中可以做日志操作'''
           
            python spaces">        
            python functions">print
            python plain">(
            python string">'日志'
            python plain">)

2.配置settings文件

python"> 
      
            python comments"># 修改默认的去重规则
           
            python comments"># DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
           
            python plain">DUPEFILTER_CLASS 
            python keyword">= 
            python string">'xxx.dupefilters.XXXDupeFilter'

深度

深度就是爬虫所要爬取的层级

限制深度只需要配置一下即可

python"> 
      
            python comments"># 限制深度
           
            python plain">DEPTH_LIMIT 
            python keyword">= 
            python value">3

cookie

获取上一次请求之后获得的cookie

python"> 
      
            python keyword">from 
            python plain">scrapy.http.cookies 
            python keyword">import 
            python plain">CookieJar
           

             
           

            python keyword">class 
            python plain">ChoutiSpider(scrapy.Spider):
           

            python spaces">    
            python plain">name 
            python keyword">= 
            python string">'chouti'
           

            python spaces">    
            python plain">allowed_domains 
            python keyword">= 
            python plain">[
            python string">'chouti.com'
            python plain">]
           

            python spaces">    
            python plain">start_urls 
            python keyword">= 
            python plain">[
            python string">'https://dig.chouti.com/'
            python plain">]
           

            python spaces">    
            python plain">cookie_dict 
            python keyword">= 
            python plain">{}
           

            python spaces">    
            python keyword">def 
            python plain">parse(
            python color1">self
            python plain">, response):
           

             
           

            python spaces">        
            python comments"># 去响应头中获取cookie，cookie保存在cookie_jar对象
           

            python spaces">        
            python plain">cookie_jar 
            python keyword">= 
            python plain">CookieJar()
           

            python spaces">        
            python plain">cookie_jar.extract_cookies(response, response.request)
           

             
           

            python spaces">        
            python comments"># 去对象中将cookie解析到字典
           

            python spaces">        
            python keyword">for 
            python plain">k, v 
            python keyword">in 
            python plain">cookie_jar._cookies.items():
           

            python spaces">            
            python keyword">for 
            python plain">i, j 
            python keyword">in 
            python plain">v.items():
           

            python spaces">                
            python keyword">for 
            python plain">m, n 
            python keyword">in 
            python plain">j.items():
           

            python spaces">                    
            python color1">self
            python plain">.cookie_dict[m] 
            python keyword">= 
            python plain">n.value
           

     

再次请求的时候携带cookie

python"> 
      
            python keyword">yield 
            python plain">Request(
           

            python spaces">           
            python plain">url
            python keyword">=
            python string">'https://dig.chouti.com/login'
            python plain">,
           

            python spaces">           
            python plain">method
            python keyword">=
            python string">'POST'
            python plain">,
           

            python spaces">           
            python plain">body
            python keyword">=
            python string">"phone=861300000000&password=12345678&oneMonth=1"
            python plain">,
            python comments">#
           

            python spaces">           
            python plain">cookies
            python keyword">=
            python color1">self
            python plain">.cookie_dict,
           

            python spaces">           
            python plain">headers
            python keyword">=
            python plain">{
           

            python spaces">               
            python string">'Content-Type'
            python plain">: 
            python string">'application/x-www-form-urlencoded; charset=UTF-8'
           

            python spaces">           
            python plain">},
           

            python spaces">           
            python plain">callback
            python keyword">=
            python color1">self
            python plain">.check_login
           

            python spaces">       
            python plain">)
           

     

是不是感觉很麻烦？

那么，呵呵，其实，嘿嘿，

你只需要在Request对象的参数中加入 meta={'cookiejar': True} 即可！

网络爬虫之scrapy框架设置代理

前戏

os.environ()简介

os.environ()可以获取到当前进程的环境变量，注意，是当前进程。

如果我们在一个程序中设置了环境变量，另一个程序是无法获取设置的那个变量的。

环境变量是以一个字典的形式存在的，可以用字典的方法来取值或者设置值。

os.environ() key字段详解

windows：

python"> 
      
            python plain">os.environ[
            python string">'HOMEPATH'
            python plain">]:当前用户主目录。
           
            python plain">os.environ[
            python string">'TEMP'
            python plain">]:临时目录路径。
           
            python plain">os.environ[PATHEXT']:可执行文件。
           
            python plain">os.environ[
            python string">'SYSTEMROOT'
            python plain">]:系统主目录。
           
            python plain">os.environ[
            python string">'LOGONSERVER'
            python plain">]:机器名。
           
            python plain">os.environ[
            python string">'PROMPT'
            python plain">]:设置提示符。

linux：

python"> 
      
            python plain">os.environ[
            python string">'USER'
            python plain">]:当前使用用户。
           
            python plain">os.environ[
            python string">'LC_COLLATE'
            python plain">]:路径扩展的结果排序时的字母顺序。
           
            python plain">os.environ[
            python string">'SHELL'
            python plain">]:使用shell的类型。
           
            python plain">os.environ[
            python string">'LAN'
            python plain">]:使用的语言。
           
            python plain">os.environ[
            python string">'SSH_AUTH_SOCK'
            python plain">]:ssh的执行路径。

内置的方式

原理

scrapy框架内部已经实现了设置代理的方法，它的原理是从环境变量中取出设置的代理，然后再使用，

所以我们只需要在程序执行前将代理以键值对的方式设置到环境变量中即可。

代码

第一种方式：直接添加键值对的方式

python"> 
      
            python keyword">class 
            python plain">ChoutiSpider(scrapy.Spider):
           

            python spaces">    
            python plain">name 
            python keyword">= 
            python string">'chouti'
           

            python spaces">    
            python plain">allowed_domains 
            python keyword">= 
            python plain">[
            python string">'chouti.com'
            python plain">]
           

            python spaces">    
            python plain">start_urls 
            python keyword">= 
            python plain">[
            python string">'https://dig.chouti.com/'
            python plain">]
           

            python spaces">    
            python plain">cookie_dict 
            python keyword">= 
            python plain">{}
           

             
           

            python spaces">    
            python keyword">def 
            python plain">start_requests(
            python color1">self
            python plain">):
           

            python spaces">        
            python keyword">import 
            python plain">os
           

            python spaces">        
            python plain">os.environ[
            python string">'HTTPS_PROXY'
            python plain">] 
            python keyword">= 
            python string">"http://username:password@192.168.11.11:9999/"
           

            python spaces">        
            python plain">os.environ[
            python string">'HTTP_PROXY'
            python plain">] 
            python keyword">= 
            python string">'19.11.2.32'
            python plain">,
           

            python spaces">        
            python keyword">for 
            python plain">url 
            python keyword">in 
            python color1">self
            python plain">.start_urls:
           

            python spaces">            
            python keyword">yield 
            python plain">Request(url
            python keyword">=
            python plain">url,callback
            python keyword">=
            python color1">self
            python plain">.parse)
           

     

第二种方式：设置meta参数的方式

python"> 
      
            python keyword">class 
            python plain">ChoutiSpider(scrapy.Spider):
           

            python spaces">    
            python plain">name 
            python keyword">= 
            python string">'chouti'
           

            python spaces">    
            python plain">allowed_domains 
            python keyword">= 
            python plain">[
            python string">'chouti.com'
            python plain">]
           

            python spaces">    
            python plain">start_urls 
            python keyword">= 
            python plain">[
            python string">'https://dig.chouti.com/'
            python plain">]
           

            python spaces">    
            python plain">cookie_dict 
            python keyword">= 
            python plain">{}
           

             
           

            python spaces">    
            python keyword">def 
            python plain">start_requests(
            python color1">self
            python plain">):
           

            python spaces">        
            python keyword">for 
            python plain">url 
            python keyword">in 
            python color1">self
            python plain">.start_urls:
           

            python spaces">            
            python keyword">yield 
            python plain">Request(url
            python keyword">=
            python plain">url,callback
            python keyword">=
            python color1">self
            python plain">.parse,meta
            python keyword">=
            python plain">{
            python string">'proxy'
            python plain">:
            python string">'"http://username:password@192.168.11.11:9999/"'
            python plain">})
           

     

自定义方式

原理

我们可以根据内部实现的添加代理的类（中间件）的实现方法，来对它进行升级，比如内部的方式一次只能使用一个代理，

我们可以弄一个列表，装很多代理地址，然后随机选取一个代理，这样可以防止请求过多被封ip

代码

python">

            python keyword">class 
            python plain">ChoutiSpider(scrapy.Spider):
           

            python spaces">    
            python plain">name 
            python keyword">= 
            python string">'chouti'
           

            python spaces">    
            python plain">allowed_domains 
            python keyword">= 
            python plain">[
            python string">'chouti.com'
            python plain">]
           

            python spaces">    
            python plain">start_urls 
            python keyword">= 
            python plain">[
            python string">'https://dig.chouti.com/'
            python plain">]
           

            python spaces">    
            python plain">cookie_dict 
            python keyword">= 
            python plain">{}
           

             
           

            python spaces">    
            python keyword">def 
            python plain">start_requests(
            python color1">self
            python plain">):
           

            python spaces">        
            python keyword">for 
            python plain">url 
            python keyword">in 
            python color1">self
            python plain">.start_urls:
           

            python spaces">            
            python keyword">yield 
            python plain">Request(url
            python keyword">=
            python plain">url,callback
            python keyword">=
            python color1">self
            python plain">.parse,meta
            python keyword">=
            python plain">{
            python string">'proxy'
            python plain">:
            python string">'"http://username:password@192.168.11.11:9999/"'
            python plain">})
           

             
           

            python spaces">            
            python keyword">import 
            python plain">base64
           

            python spaces">            
            python keyword">import 
            python plain">random
           

            python spaces">            
            python keyword">from 
            python plain">six.moves.urllib.parse 
            python keyword">import 
            python plain">unquote
           

            python spaces">            
            python keyword">try
            python plain">:
           

            python spaces">                
            python keyword">from 
            python plain">urllib2 
            python keyword">import 
            python plain">_parse_proxy
           

            python spaces">            
            python keyword">except 
            python plain">ImportError:
           

            python spaces">                
            python keyword">from 
            python plain">urllib.request 
            python keyword">import 
            python plain">_parse_proxy
           

            python spaces">            
            python keyword">from 
            python plain">six.moves.urllib.parse 
            python keyword">import 
            python plain">urlunparse
           

            python spaces">            
            python keyword">from 
            python plain">scrapy.utils.python 
            python keyword">import 
            python plain">to_bytes
           

             
           

            python spaces">            
            python keyword">class 
            python plain">XXProxyMiddleware(
            python functions">object
            python plain">):
           

             
           

            python spaces">                
            python keyword">def 
            python plain">_basic_auth_header(
            python color1">self
            python plain">, username, password):
           

            python spaces">                    
            python plain">user_pass 
            python keyword">= 
            python plain">to_bytes(
           

            python spaces">                        
            python string">'%s:%s' 
            python keyword">% 
            python plain">(unquote(username), unquote(password)),
           

            python spaces">                        
            python plain">encoding
            python keyword">=
            python string">'latin-1'
            python plain">)
           

            python spaces">                    
            python keyword">return 
            python plain">base64.b64encode(user_pass).strip()
           

             
           

            python spaces">                
            python keyword">def 
            python plain">process_request(
            python color1">self
            python plain">, request, spider):
           

            python spaces">                    
            python plain">PROXIES 
            python keyword">= 
            python plain">[
           

            python spaces">                        
            python string">"http://username:password@192.168.11.11:9999/"
            python plain">,
           

            python spaces">                        
            python string">"http://username:password@192.168.11.12:9999/"
            python plain">,
           

            python spaces">                        
            python string">"http://username:password@192.168.11.13:9999/"
            python plain">,
           

            python spaces">                        
            python string">"http://username:password@192.168.11.14:9999/"
            python plain">,
           

            python spaces">                        
            python string">"http://username:password@192.168.11.15:9999/"
            python plain">,
           

            python spaces">                        
            python string">"http://username:password@192.168.11.16:9999/"
            python plain">,
           

            python spaces">                    
            python plain">]
           

            python spaces">                    
            python plain">url 
            python keyword">= 
            python plain">random.choice(PROXIES)
           

             
           

            python spaces">                    
            python plain">orig_type 
            python keyword">= 
            python plain">""
           

            python spaces">                    
            python plain">proxy_type, user, password, hostport 
            python keyword">= 
            python plain">_parse_proxy(url)
           

            python spaces">                    
            python plain">proxy_url 
            python keyword">= 
            python plain">urlunparse((proxy_type 
            python keyword">or 
            python plain">orig_type, hostport, '
            python string">', '
            python string">', '
            python string">', '
            python plain">'))
           

             
           

            python spaces">                    
            python keyword">if 
            python plain">user:
           

            python spaces">                        
            python plain">creds 
            python keyword">= 
            python color1">self
            python plain">._basic_auth_header(user, password)
           

            python spaces">                    
            python keyword">else
            python plain">:
           

            python spaces">                        
            python plain">creds 
            python keyword">= 
            python color1">None
           

            python spaces">                    
            python plain">request.meta[
            python string">'proxy'
            python plain">] 
            python keyword">= 
            python plain">proxy_url
           

            python spaces">                    
            python keyword">if 
            python plain">creds:
           

            python spaces">                        
            python plain">request.headers[
            python string">'Proxy-Authorization'
            python plain">] 
            python keyword">= 
            python plain">b
            python string">'Basic ' 
            python keyword">+ 
            python plain">creds
           

写完类之后需要在settings文件里配置一下：

python"> 
      
            python plain">DOWNLOADER_MIDDLEWARES 
            python keyword">= 
            python plain">{
           
            python spaces">   
            python string">'spider.xxx.XXXProxyMiddleware'
            python plain">: 
            python value">543
            python plain">,
           
            python plain">}

转载于:https://www.cnblogs.com/xyhh/p/10860873.html