pypepeteer的使用代替selenium(防止反爬)

news/2024/7/19 9:21:59 标签: python, 爬虫, git

# pypepeteer同样可以操作浏览器,和selenium具有同样的功能,但是很多反爬虫网站能检测到selenium,所以同样拿不到数据,那我们就得pypepeteer

以下是官方说明:

  

Installation

Pyppeteer requires python 3.6+. (experimentally supports python 3.5)

Install by pip from PyPI:

python3 -m pip install pyppeteer

Or install latest version from github:

python3 -m pip install -U git+https://github.com/miyakogi/pyppeteer.git@dev

github.com/jiabotao1989/pyppeteer#usage">Usage

Note: When you run pyppeteer first time, it downloads a recent version of Chromium (~100MB). If you don't prefer this behavior, run pyppeteer-install command before running scripts which uses pyppeteer.

Example: open web page and take a screenshot.

python">
import asyncio
from pyppeteer import launch

async def main(): browser = await launch() page = await browser.newPage() await page.goto('http://example.com') await page.screenshot({'path': 'example.png'}) await browser.close() asyncio.get_event_loop().run_until_complete(main())

Example: evaluate script on the page.

python">
import asyncio
from pyppeteer import launch

async def main(): browser = await launch() page = await browser.newPage() await page.goto('http://example.com') await page.screenshot({'path': 'example.png'}) dimensions = await page.evaluate('''() => {  return {  width: document.documentElement.clientWidth,  height: document.documentElement.clientHeight,  deviceScaleFactor: window.devicePixelRatio,  }  }''') print(dimensions) # >>> {'width': 800, 'height': 600, 'deviceScaleFactor': 1} await browser.close() asyncio.get_event_loop().run_until_complete(main())

Pyppeteer has almost same API as puppeteer. More APIs are listed in the document.

Puppeteer's document and troubleshooting are also useful for pyppeteer users.

github.com/jiabotao1989/pyppeteer#differences-between-puppeteer-and-pyppeteer">Differences between puppeteer and pyppeteer

Pyppeteer is to be as similar as puppeteer, but some differences between python and JavaScript make it difficult.

These are differences between puppeteer and pyppeteer.

github.com/jiabotao1989/pyppeteer#keyword-arguments-for-options">Keyword arguments for options

Puppeteer uses object (dictionary in python) for passing options to functions/methods. Pyppeteer accepts both dictionary and keyword arguments for options.

Dictionary style option (similar to puppeteer):

python">
browser = await launch({'headless': True})

Keyword argument style option (more pythonic, isn't it?):

python">
browser = await launch(headless=True)


实际演练:
  
python">python">import asyncio
import  pyppeteer
import os

os.environ['PYPPETEER_CHROMIUM_REVISION'] ='588429' pyppeteer.DEBUG = True async def main(): print("in main ") print(os.environ.get('PYPPETEER_CHROMIUM_REVISION')) browser = await pyppeteer.launch() page = await browser.newPage() await page.goto('http://www.baidu.com') content = await page.content() cookies = await page.cookies() # await page.screenshot({'path': 'example.png'}) await browser.close() return {'content':content, 'cookies':cookies} loop = asyncio.get_event_loop() task = asyncio.ensure_future(main()) loop.run_until_complete(task) print(task.result()) 

与scrapy的整合

加入downloadmiddleware

python">python">from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware import random import pyppeteer import asyncio import os from scrapy.http import HtmlResponse pyppeteer.DEBUG = False class FundscrapyDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. def __init__(self) : print("Init downloaderMiddleware use pypputeer.") os.environ['PYPPETEER_CHROMIUM_REVISION'] ='588429' # pyppeteer.DEBUG = False print(os.environ.get('PYPPETEER_CHROMIUM_REVISION')) loop = asyncio.get_event_loop() task = asyncio.ensure_future(self.getbrowser()) loop.run_until_complete(task) #self.browser = task.result() print(self.browser) print(self.page) # self.page = await browser.newPage() async def getbrowser(self): self.browser = await pyppeteer.launch() self.page = await self.browser.newPage() # return await pyppeteer.launch() async def getnewpage(self): return await self.browser.newPage()  @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called loop = asyncio.get_event_loop() task = asyncio.ensure_future(self.usePypuppeteer(request)) loop.run_until_complete(task) # return task.result() return HtmlResponse(url=request.url, body=task.result(), encoding="utf-8",request=request) async def usePypuppeteer(self, request): print(request.url) # page = await self.browser.newPage() await self.page.goto(request.url) content = await self.page.content() return content def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)

作者:金刚_30bf
链接:https://www.jianshu.com/p/fd9eb385a70e
来源:简书
简书著作权归作者所有,任何形式的转载都请联系作者获得授权并注明出处。

转载于:https://www.cnblogs.com/jiabotao/p/10438733.html


http://www.niftyadmin.cn/n/753335.html

相关文章

ubuntu 关闭mysql进程_腾讯云Ubuntu下定时监测MySQL进程终止后自动重启的方法

作者:崔庆才,Python教程作者前言最近发现MySQL服务隔三差五就会挂掉,导致我的网站和爬虫都无法正常运作。自己的网站是基于MySQL,在做爬虫存取一些资料的时候也是基于MySQL,数据量一大了,MySQL它就有点受不…

Oracle数据库RowId

RowId是什么? RowId是根据每一行数据的物理信息地址编码而成的一个位列,利用RowId可以快速定位到某一行。 Oracle数据库编辑数据必须查出RowId,可以根据如下语句查询: select a.rowid, a.* from 表名 a where 11 rowId和主键区别&…

jsp mysql论文_JSP+MySql基于JSP的BBS实现毕业设计与论文

内容:程序代码,JSP毕业论文(11312字),开题报告,答辩PPT,外文文献摘要:现今的社会是一个信息飞速发达的社会,其中在信息的交流当中,互联网占据着一个非常重要的位置。人们可以通过在互…

RPC细节

服务化有什么好处? 服务化的一个好处就是,不限定服务的提供方使用什么技术选型,能够实现大公司跨团队的技术解耦,如下图所示: 服务A:欧洲团队维护,技术背景是Java 服务B:美洲团队维…

mysql与orancl_清晰讲解SQL语句中的内连接,通用于Mysql和Oracle,全是干货哦

本文章目的:力求清晰明了讲解SQL语句的内连接的各种应用,没有深奥的理解!前奏:这篇文章和下篇文章会将内连接和外连接讲解清楚SQL语句的多表查询常用的有以下几种:两表联合查询(1)内连接(2)外连接(分左外连接、右外连接)(3)全外连…

bzoj1022: [SHOI2008]小约翰的游戏John

反nim游戏 性质&#xff1a;1、游戏的SG值为0且所有子游戏SG值均不超过1。 2、游戏的SG值不为0且至少一个子游戏SG值超过1。先手必胜 #include<cstdio> #include<cstring> using namespace std; int main() {int T;scanf("%d",&T);while(T--){int n,…

php与yum安装mysql_yum安装MySQL、Apache、PHP及discuz论坛

准备 LAMP 环境LAMP 是 Linux、Apache、MySQL 和 PHP 的缩写&#xff0c;是 Discuz 论坛系统依赖的基础运行环境。我们先来准备 LAMP 环境安装 MySQL使用 yum 安装 MySQL&#xff1a;yum install mysql-server -y安装完成后&#xff0c;启动 MySQL 服务&#xff1a;service mys…

Java使用Sigar获取系统参数

一、配置 Windows&#xff1a; 系统是64位则把sigar-amd64-winnt.dll添加到C:\Windows\System32 系统是32位则把sigar-x86-winnt.dll添加到C:\Windows\System32 项目引用 sigar.jar 这个jar包 Linux&#xff1a; 把libsigar-amd64-Linux.so&#xff08;64bit&#xff09;或libs…