第十六章 爬虫scrapy登录与中间件

news/2024/7/19 9:48:48 标签: 爬虫, scrapy, 中间件

文章目录

  • 1. scrapy处理cookie
    • 1. 直接从浏览器复制cookie
    • 2.登录流程获取cookie
  • 2. 中间件
      • 1. 请求中间件
      • 2. sittings文件中设置UserAgent
      • 3. 使用中间件配置代理
      • 4. 使用selenium获取页面信息

scrapycookie_2">1. scrapy处理cookie

1. 直接从浏览器复制cookie

scrapy.Requests()中的cookies属于字典,需要转换

 def start_requests(self):
     cookie_str = "GUID=**702477247"
     cookie_dic = {}
     cookie_lst = cookie_str.split("; ")
     for it in cookie_lst:
         if "https://" in it:
             it_cop = it.replace("=","|",1)
             k, v = it_cop.split("|")
             cookie_dic[k.strip()] = v.strip()
         else:
             k,v = it.split("=")
             cookie_dic[k.strip()] = v.strip()
     head = {
         "Referer":"https://user.17k.com/www/bookshelf/",
         "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
     }

     yield scrapy.Request(
         url=self.start_urls[0],
         headers=head,
         cookies=cookie_dic)


2.登录流程获取cookie

scrapy.FormRequest()可以提交post请求


def start_requests(self):
    url = "https://passport.17k.com/ck/user/login"
    yield scrapy.FormRequest(
        url=url,
        formdata ={
            "loginName":"***",
            "password":"***"
        },
        callback=self.parse
    )

def parse(self, response,**kwargs):
    yield scrapy.Request(
        url=self.start_urls[0],
        callback=self.get_shujia
    )

def get_shujia(self, resp):
    print(resp.text)

2. 中间件

  1. DownloaderMiddleware
        下载中间件,它是介于引擎和下载器之间,引擎在获取到request对象后,会交给下载器去下载,在这之间我们可以设置下载中间件
  2. SpiderMiddleware
        爬虫中间件.是处于引擎和spider之间的中间件
class XiaoshuoDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        '''
        :param request:当前的请求
        :param response:请求的响应
        :param spider:发送请求的spider
        :return:
        '''
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)


1. 请求中间件

在引擎将请求的信息交给下载器之前,自动的调用该方
param request:当前请求
param spider:发出该请求的spider
注意,process_request返回值是有规定的。

  1. 如果返回的None,不做拦截.继续向后面的中间件执行.
  2. 如果返回的是Request、后续的中间件将不再执行.将请求重新交给引擎.引擎重新扔给调度器
  3. 如果返回的是Response、后续的中间件将不再执行.将响应信息交给引擎.引擎将响应丢给spider,进行数

2. sittings文件中设置UserAgent

# setting文件中打开中间USER_AGENT
# USER_AGENT = "xiaoshuo (+http://www.yourdomain.com)"

  • 列表形式创建
# 导包
from xiaoshuo.settings import USER_AGENT_LIST
from random import choice



def process_request(self, request, spider):
    # Called for each request that goes through the downloader
    # middleware.

    # Must either:
    # - return None: continue processing this request
    # - or return a Response object
    # - or return a Request object
    # - or raise IgnoreRequest: process_exception() methods of
    #   installed downloader middleware will be called
    UA = choice(USER_AGENT_LIST)
    request.headers['User-Agent'] = UA
    return None
# 打开setting文件中的中间件
#DOWNLOADER_MIDDLEWARES = {
#    "xiaoshuo.middlewares.XiaoshuoDownloaderMiddleware": 543,
}

3. 使用中间件配置代理

配置代理的网站:https://www.kuaidaili.com/

免费代理

# 在setting文件中配置proxt列表
PROXY_IP_LIST = [
  "27.154.6.110":"20714",
  "115.219.1.53":"20231"
]
# 导包
from xiaoshuo.settings import PROXY_IP_LIST
from random import choice

def process_request(self, request, spider):
    # Called for each request that goes through the downloader
    # middleware.

    # Must either:
    # - return None: continue processing this request
    # - or return a Response object
    # - or return a Request object
    # - or raise IgnoreRequest: process_exception() methods of
    #   installed downloader middleware will be called
	IP = choice(PROXY_IP_LIST)
	request.meta['proxy'] = "https://" + ip
    return None



设置隧道道理

# 快代理的网站右scrapy中间件的源码
def process_request(self, request, spider):
    # Called for each request that goes through the downloader
    # middleware.

    # Must either:
    # - return None: continue processing this request
    # - or return a Response object
    # - or return a Request object
    # - or raise IgnoreRequest: process_exception() methods of
    #   installed downloader middleware will be called
	proxy = "tps138.kdlapi.com:15919"
	request.meta['proxy'] = f"http://{proxy}"
	# 隧道生成的用户名密码
	request.headers ['Proxy-Authorization'] = basic_auth_header('user','pwd')
	request.headers ["Connection"]= "close"

4. 使用selenium获取页面信息

使用selenium登录获取cookie

原来中间件最大优先级是100
我们要想替换掉原来的中间件,可以设置优先级为99


DOWNLOADER_MIDDLEWARES = {
   "xiaoshuo.middlewares.XiaoshuoDownloaderMiddleware": 99,
}
def process_request(self, request, spider):
	#所有的请求都会到我这里.
	#需要进行判断.判断出是否需要用selenium来处理请求
	#开始selenium的操作、返回页面源代码组装的response


新建一个reauest.py文件处理selenium请求

from scrapy import Request

class SeleniumRequest(Request):
	paa
# 自定义selenium请求,继承scrapy中的request、

中间件中判断请求是不是 SeleniumReques请求


# 导包
from scrapy.http.response.html import HtmlResponse
from selenium.webdriver import Chrome

def process_request(self, request, spider):
	#所有的请求都会到我这里.
	#需要进行判断.判断出是否需要用selenium来处理请求
	#开始selenium的操作、返回页面源代码组装的response
	# isinstance(request,SeleniumReques):
	if isinstance(request,SeleniumReques):
		self.web.get(request.url)
		page_source = self.web.page_source
		# 封装一个响应对象
		return HtmlResponse(
			url = request.url,
			status = 200,
			body = page_source,
			request = request,
			encoding = 'utf-8'
		)
	else:
		return None

def spider_opened(self, spider):
    self.web = Chrome()

中间件中自定义步骤

# 修改from_crawler函数
def from_crawler(cls, crawler):
    # This method is used by Scrapy to create your spiders.
    s = cls()
   	# 						自定义的步骤				执行时间	
    crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
    return s

'''
爬虫执行的时机
engine_started = object()
engine_stopped = object()
spider_opened = object()
spider_idle = object()
spider_closed = object()
spider_error = object()
request_scheduled = object()
request_dropped = object()
request_reached_downloader = object()
request_left_downloader = object()
response_received = object()
response_downloaded = object()
headers_received = object()
bytes_received = object()
item_scraped = object()
item_dropped = object()
item_error = object()
feed_slot_closed = object()
feed_exporter_closed = object()
'''


http://www.niftyadmin.cn/n/5275226.html

相关文章

java连接oracle出现ORA-12505错误

问题 sqlplus可以连接;但java连接报错:ORA-12505 ORA-12505, TNS:listener does not currently know of SID given in connect descr 解析 原因: 数据库中实际使用的实例名并非与集群对外使用的相同,使用第三方构件或程序进行连接的时候,所给数据库运…

【极客公园 IF 2024】王小川:AGI时代产品创新的起点,从PMF到TPF

文章目录 01 国王与画匠,寻找 AI Native 的寓言02 PMF 考核用户量,TPF 要看测试集03 新时代的创业者,首先要是大模型的超级玩家04 今天的大模型是「快思考」,AI 需要「慢思考」05 理想上慢一步,落地上快三步06 百川创业…

链接未来:深入理解链表数据结构(一.c语言实现无头单向非循环链表)

在上一篇文章中,我们探索了顺序表这一基础的数据结构,它提供了一种有序存储数据的方法,使得数据的访 问和操作变得更加高效。想要进一步了解,大家可以移步于上一篇文章:探索顺序表:数据结构中的秩序之美 今…

MyBatis ${}和#{}区别

sql防注入底层jdbc类型转换当简单类型参数$不防止Statment不转换value#防止preparedStatement转换任意 除模糊匹配外,杜绝使用${} MyBatis教程,大家可以借鉴 MyBatis 教程_w3cschool 主要区别 1、#{} 是预编译处理,${} 是直接替换&#…

uniapp 导入ucharts图表插件 H5项目, 使用echarts eopts配置

先下载ucharts H5示例源码: uCharts: 高性能跨平台图表库,支持H5、APP、小程序(微信小程序、支付宝小程序、钉钉小程序、百度小程序、头条小程序、QQ小程序、快手小程序、360小程序)、Vue、Taro等更多支持canvas的框架平台&#…

记录每日LeetCode 162.寻找峰值与1901.寻找峰值II Java实现

寻找峰值 题目描述: 峰值元素是指其值严格大于左右相邻值的元素。 给你一个整数数组 nums,找到峰值元素并返回其索引。数组可能包含多个峰值,在这种情况下,返回 任何一个峰值 所在位置即可。 你可以假设 nums[-1] nums[n] -…

Redis字符串数据类型之INCR命令,通常用于统计网站访问量,文章访问量,实现分布式锁

前言 Redis的INCR命令用于将键的值增加1。如果键不存在,则会先将键的值设置为0,然后再执行INCR操作。INCR命令的作用是对计数器进行自增操作,可以用于实现多种场景,比如统计网站访问量、文章访问量、分布式锁等。 一、Redis字符…

邦芒支招:9个职场有效沟通技巧

在职场中,高效沟通是至关重要的。以下是一些建议,帮助你在工作职场中实现高效沟通: 1、明确目标:在开始沟通之前,确保你清楚自己的目标和期望结果。明确你的沟通目的,以便在沟通过程中保持专注和针对性。 2…