前言
前段时间看了一些介绍Scrapy及用Scarpy进行抓取网络信息的博客。总体来说信息量还是过少,对于一个成熟的框架来说,只看博客还是不够。所以还是看了一遍官方文档。
看完后,总要做点什么来练练手,正好前段时间在网上闲逛的时候找到了一个国内某大神做的某国外博客的聚合类网站。里面涉及到大量博客地址。点击某博客后,会列出该博客下所有视频地址。其实该网站也是一个爬虫。
将所有视频下载下来是不现实的。将博客地址存取下来即可,后续需要的时候再编写一个爬虫用于解析该博客下的所有图片、文字、视频。
Scrapy安装
Scrapy安装用pip即可。本次练习采用的是Python3.5.2,win7 64位系统。集成于Anaconda。官网上推荐如下安装方式:
conda install -c scrapinghub scrapy
但安装完后在startproject的时候出现错误。于是又用pip卸载了scrapy,再用pip安装scrapy,就行了,具体原因不详。。
开始项目
在想要存放项目的位置打开cmd。输入以下命令(XXX为项目名称):
scrapy startproject XXX
编写item
由于该网站结构比较简单,每页可提取出30个博客地址,因此items.py比较简单,只要有一个装数据的容器即可:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class XXXItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
Blogs_per_page = scrapy.Field()
编写pipeline
安装MongoDB
MongoDB网址:https://www.mongodb.com/
下载地址:https://www.mongodb.com/download-center#community
下载完成后一顿安装。之后进入C:\Program Files\MongoDB\Server\3.2\bin,打开cmd。输入mongod,得到以下信息:
Microsoft Windows [版本 6.1.7601]
版权所有 (c) 2009 Microsoft Corporation。保留所有权利。
C:\Program Files\MongoDB\Server\3.2\bin>mongod
2016-10-11T12:36:54.234+0800 I CONTROL [main] Hotfix KB2731284 or later update
is not installed, will zero-out data files
2016-10-11T12:36:54.236+0800 I CONTROL [initandlisten] MongoDB starting : pid=3
3256 port=27017 dbpath=C:\data\db\ 64-bit host=CJPC160816-051
2016-10-11T12:36:54.236+0800 I CONTROL [initandlisten] targetMinOS: Windows 7/W
indows Server 2008 R2
2016-10-11T12:36:54.236+0800 I CONTROL [initandlisten] db version v3.2.10
2016-10-11T12:36:54.236+0800 I CONTROL [initandlisten] git version: 79d9b3ab5ce
20f51c272b4411202710a082d0317
2016-10-11T12:36:54.236+0800 I CONTROL [initandlisten] OpenSSL version: OpenSSL
1.0.1t-fips 3 May 2016
2016-10-11T12:36:54.236+0800 I CONTROL [initandlisten] allocator: tcmalloc
2016-10-11T12:36:54.237+0800 I CONTROL [initandlisten] modules: none
2016-10-11T12:36:54.237+0800 I CONTROL [initandlisten] build environment:
2016-10-11T12:36:54.237+0800 I CONTROL [initandlisten] distmod: 2008plus-ss
l
2016-10-11T12:36:54.237+0800 I CONTROL [initandlisten] distarch: x86_64
2016-10-11T12:36:54.237+0800 I CONTROL [initandlisten] target_arch: x86_64
2016-10-11T12:36:54.237+0800 I CONTROL [initandlisten] options: {}
2016-10-11T12:36:54.239+0800 I - [initandlisten] Detected data files in C
:\data\db\ created by the 'wiredTiger' storage engine, so setting the active sto
rage engine to 'wiredTiger'.
2016-10-11T12:36:54.241+0800 I STORAGE [initandlisten] wiredtiger_open config:
create,cache_size=1G,session_max=20000,eviction=(threads_max=4),config_base=fals
e,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snapp
y),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),stati
stics_log=(wait=0),
2016-10-11T12:36:55.115+0800 I NETWORK [HostnameCanonicalizationWorker] Startin
g hostname canonicalization worker
2016-10-11T12:36:55.115+0800 I FTDC [initandlisten] Initializing full-time d
iagnostic data capture with directory 'C:/data/db/diagnostic.data'
2016-10-11T12:36:55.147+0800 I NETWORK [initandlisten] waiting for connections
on port 27017
说明MongoDB数据库打开了,端口为本地27017端口。
编写的pipelines.py文件如下:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from .items import XXXItem
class XXXPipeline(object):
def __init__(self):
client = pymongo.MongoClient("localhost", 27017)
db = client["XXX"]
self.blogs = db["Blogs"]
def process_item(self, item, spider):
if isinstance(item, XXXItem):
self.blogs.insert(dict(item))
编写spider
终于要开始编写爬虫程序了。在XXX-spiders目录下新建一个XXX_spider.py文件。代码如下,相关解释可以看注释:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from tumblr_get.items import XXXItem
class XXXSpider(scrapy.Spider):
# 此爬虫唯一的名字
name = 'XXX_spider'
# 爬虫开始爬的链接
start_urls = ['http://www.XXX.com/blogs.html?page=1&name=']
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.page_count = 100
self.num_count = 1
def start_requests(self):
'''用于不断构造request请求'''
while self.num_count < self.page_count:
# 翻页
self.num_count += 1
# 构造请求,并指定回调函数
# 这里用meta给回调函数传参数
yield scrapy.Request(url='http://www.XXX.com/blogs.html?page=%s&name=' % str(self.num_count),
meta={'count': self.num_count},
callback=self.parse)
def parse(self, response):
# meta里带着构造request时的参数
print('开始第{0:d}页'.format(response.meta['count']))
# 用XPath解析网页
selector1 = Selector(response=response)
blogs_per_page_list = selector1.xpath('//*[@id="amz-main"]/div/div/table/tbody/tr/td/a/span/text()').extract()
# 实例化item,用于存储数据
XXX_item = XXXItem()
XXX_item['Blogs_per_page'] = blogs_per_page_list
# 返回该item,供pipeline作进一步处理
yield XXX_item
其中解析网页用到了XPath,比bs4要快。相关教程可去w3c找。
用Chrome得到XPath较为方便,选好相关节点后,右键复制XPath即可,通常复制后的XPath指向某特定节点,带着中括号里面有编号,若想获取所有相同类型的节点,将方括号及之中的内容去掉即可。
设定中间件Middleware
为了防止爬虫被ban,通常的做法主要有三种:
- 降低采集频率
- 设置UserAgent将爬虫伪装成浏览器
- 设置代理ip
import random
from XXX.settings import USER_AGENT_LIST
from XXX.settings import PROXY_LIST
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(USER_AGENT_LIST)
if ua:
request.headers["User-Agent"] = ua
class ProxyMiddleware(object):
def process_request(self, request, spider):
ip1 = 'http://' + random.choice(PROXY_LIST)
request.meta['proxy'] = ip1
编写配置文件settings
这里只将本次练习所涉及到的配置项列出来,其他具体可参见文档。
# -*- coding: utf-8 -*-
# Scrapy settings for tumblr_get project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'XXX'
SPIDER_MODULES = ['XXX.spiders']
NEWSPIDER_MODULE = 'XXX.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1.0
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'XXX.middlewares.RandomUserAgentMiddleware': 300,
'XXX.middlewares.ProxyMiddleware': 310,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None
}
LOG_LEVEL = 'INFO'
USER_AGENT_LIST = [
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0',
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)']
PROXY_LIST = open('ip.txt','r').read().split('\n')
UserAgent是在网上搜的,有很多,适当选几个就行。可用的ip可以在网上找免费的,但说实话不太好用,有的时候请求不出来。所以真要用代理ip,还要加上验证和不断获取代理ip的代码。
最后写一个启动文件
在项目目录新建一个begin.py:
from scrapy import cmdline
cmdline.execute("scrapy crawl XXX_spider".split())
爬取结果
从17:22:24开始,到22:35:58结束。共爬取了7719页的数据。博客名共231570个。
总结
值得注意的有以下几点:
- 实际爬取时间比上述要短,因为中途去吃饭,电脑待机了。。。所以大约要减去50分钟的样子。
- 真实数据比7719页要多,不过中间可能对方服务器有点问题,请求不出来,获取不到数据。
- 总体上为了防止被封,爬取得较慢,如果用上代理ip,再将频率调高,时间会大大缩短。
- 该站较为简单,请求全部用get,未涉及到模拟登录的问题。较为简单。
附一张Scrapy的工作流程图(来源于网络):