解决Scrapy请求丢失问题

news/2024/7/19 12:41:59 标签: 爬虫, python

在使用Scrapy爬取多页数据时，容易出现丢失请求，数据爬取不完整的问题

python">	def parse_city(self, response):
        month_urls = []
        li_list = response.xpath('/html/body/div[7]/div[1]/div[13]/div/div/ul/li/a/@href').extract()
        for li in li_list:
            day_q = li[-11:-5]
            if int(day_q) > 201600:
                # 月份的完整URL
                month_url = 'https://lishi.tianqi.com' + li
                month_urls.append(month_url)
        print(len(month_urls))
        for m_url in month_urls:
            # print(m_url)
            yield scrapy.Request(url=m_url, callback=self.parse_day)

    def parse_day(self,response):
        print(response)

问题：67个url只成功50个左右

如果settings.py文件中设置为LOG_LEVEL = 'ERROR'，即使有部分的url请求失败也不会出现报错信息
应将LOG_LEVEL = 'ERROR'改为LOG_LEVEL = 'INFO'，这样就可以看到出现问题的url，出错的原因

python"># LOG_LEVEL = 'ERROR'
LOG_LEVEL = 'INFO'

我的错误为 403 访问被拒

python">scrapy.spidermiddlewares.httperror INFO: Ignoring response <403 https://lishi.tianqi.com/zhengzhou/202001.html>:  HTTP status code is not handled or not allowed

原因应该是在一定时间内过多地访问此网站，被网站的反爬机制识别了

解决方法：
1、降低自己的访问速度（我一开始是这样做的，但是效果不太好）
2、做伪装，使用UA池和代理IP池

http://www.niftyadmin.cn/n/1371808.html

相关文章

react ref的三种写法

react ref的三种写法

有两个输入框第一个点击按钮会弹出输入框里面的内容第二个失去焦点会显示里面的内容第一种（最早的写法，以后可能会去掉） <!DOCTYPE html> <html lang"en"><head><meta charset"UTF-8"><…

阅读更多...

python基础小游戏

python基础小游戏

python基础小游戏 import randomprint("*"*10"唐僧大战白骨精""*"*10) name input(请选择你的身份:\n\t1.唐僧\n\t2.白骨精\n请选择：) if name 2:print("咦你竟然选择白骨精，就不让你是白骨精") elif name …

阅读更多...

python提取字符串中的数字

python提取字符串中的数字

利用正则表达式提取字符串中的数字 import re str_ "我11是个32字符串，我中4间有677数字88" number re.findall("\d",str_) # 输出结果为列表 # 列表中的数字的数据类型是str # [11, 32, 4, 677, 88]number [int(x) for x in number] #…

阅读更多...

python实现炫酷字母雨

python实现炫酷字母雨

python实现炫酷字母雨 import random, pygamePANEL_width 800 PANEL_highly 500 FONT_PX 15 pygame.init() # 创建一个窗口 winSur pygame.display.set_mode() font pygame.font.SysFont(123.ttf, 22) bg_suface pygame.Surface((1920,1080), flagspygame.SRCALPHA) pyg…

阅读更多...

python简易学生管理系统

python简易学生管理系统

python简易学生管理系统 filename student.txtdef main():while True:menu()choice -1try:choice int(input(请选择你的操作))except:print(请输入正确的数字)continueif choice in [0,1,2,3,4,5,6,7]:if choice 0:answer input("您确定要退出吗？y/n"…

阅读更多...

创建react脚手架和npm start时报错There might be a problem with the project dependency tree. It is likely not

创建react脚手架和npm start时报错There might be a problem with the project dependency tree. It is likely not

打开cmd输入 npm i create-react-app -g 然后选择相应路径输入 create-react-app 相应文件的名字时间可能有点慢当你想把它跑起来是输入 npm start 可能会报错这时输入npm run eject 然后选择yes 之后在输入npm start就行了

阅读更多...

react脚手架的一些文件说明

react脚手架的一些文件说明

阅读更多...

python实现希尔排序

python实现希尔排序

希尔排序 # 希尔排序基于插入排序def insert_sort_gap_(li,gap):for i in range(gap,len(li)): # i 代表摸到的牌的下标tmp li[i]j i-gap # j指的是手中的牌的下标while j > 0 and li[j] > tmp:li[jgap] li[j] # 将大的数放到后面j - gapli[jgap] tmp…

阅读更多...

最新文章