scrapy学习(1)

news/2024/7/19 12:12:30 标签: python, 爬虫, 数据库

使用scrapy开发简单爬虫的步骤:

1、创建项目

通过以上命令,可以得到下面的目录

2、开始修改items文件,====这里放置你想要爬取的或者你感兴趣的东西

python">import scrapy


class BookspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 书名
    book_name = scrapy.Field()
    # 作者
    author = scrapy.Field()
    # 分类
    book_sort = scrapy.Field()
    # 状态
    book_status = scrapy.Field()
    # 字数
    book_size = scrapy.Field()
    # 更新时间
    book_update = scrapy.Field()
    # 最新章节
    last_chapter = scrapy.Field()
    # 书籍简介
    book_intro = scrapy.Field()
    # 章节的名称
    chapter_name = scrapy.Field()
    # 书籍的章节urls
    chapter_url = scrapy.Field()
    # 章节的内容
    chapter_content = scrapy.Field()
    # 用于绑定章节顺序,章节编码
    chapter_num = scrapy.Field()

  3、开始编写各个内容爬取的程序,也就是spider

# -*- coding: utf-8 -*-
import scrapy
from BookSpider.items import BookspiderItem


class BooksSpider(scrapy.Spider):
    name = 'Books'
    allowed_domains = ['biqugex.com']
    # start_urls = ['http://biqugex.com/']

    def start_requests(self):
        url = 'https://www.biqugex.com/book_{0}'
        for i in range(1, 2):
            yield scrapy.Request(url.format(i), callback=self.parse)

    def parse(self, response):
        books = response.xpath('//div[@class="info"]')
        book_name = books.xpath('./h2/text()').extract()[0]
        author = books.xpath('./div[@class="small"]/span[1]/text()').extract()[0].split('', 2)[1]
        book_sort = books.xpath('./div[@class="small"]/span[2]/text()').extract()[0].split('', 2)[1]
        book_status = books.xpath('./div[@class="small"]/span[3]/text()').extract()[0].split('', 2)[1]
        book_size = books.xpath('./div[@class="small"]/span[4]/text()').extract()[0].split('', 2)[1]
        book_update = books.xpath('./div[@class="small"]/span[5]/text()').extract()[0].split('', 2)[1]
        last_chapter = books.xpath('./div[@class="small"]/span[6]/a/text()').extract()[0]
        book_intro = books.xpath('./div[@class="intro"]/text()').extract_first()
        chapters = response.xpath('//div[@class="listmain"]')
        urls = chapters.xpath('./dl/dd/a/@href').extract()
        for url in urls[0:4]:
            new_link = "https://www.biqugex.com" + url
            yield scrapy.Request(url=new_link, meta={'book_name': book_name, 'author': author, 'book_sort': book_sort,
                                                     'book_status': book_status, 'book_size': book_size,
                                                     'book_update': book_update, 'last_chapter': last_chapter,
                                                     'book_intro': book_intro}, callback=self.detail_parse, dont_filter=False)

    def detail_parse(self, response):
        item = BookspiderItem()
        item['book_name'] = response.meta['book_name']
        item['author'] = response.meta['author']
        item['book_sort'] = response.meta['book_sort']
        item['book_status'] = response.meta['book_status']
        item['book_size'] = response.meta['book_size']
        item['book_update'] = response.meta['book_update']
        item['last_chapter'] = response.meta['last_chapter']
        item['book_intro'] = response.meta['book_intro']
        chapter_page = response.xpath('//div[@class="content"]')
        item['chapter_name'] = chapter_page.xpath('./h1/text()').extract()[0]
        item['chapter_url'] = chapter_page.xpath('./div[@class="showtxt"]/text()').extract()[-2].strip().replace('(', '').replace(')', '')
        chapter_content = chapter_page.xpath('./div[@class="showtxt"]/text()').extract()[0:-3]
        item['chapter_content'] = '\n'.join(chapter_content)
        # item['chapter_num'] = filter(str.isdigit, chapter_page.xpath('./h1/text()').extract()[0])
        yield item
View Code

4、开始写pipelines文件,主要是用来把数据写到文件或者数据库

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import sqlite3


class BookspiderPipeline(object):
    def __init__(self):
        self.conn = sqlite3.connect('books.db')
        self.cur = self.conn.cursor()
        self.cur.execute('create table if not exists notes('
                        + 'id integer primary key autoincrement,'
                        + 'book_name,'
                        + 'author,'
                        + 'book_sort,'
                        + 'book_status,'
                        + 'book_size,'
                        + 'book_update,'
                        + 'last_chapter,'
                        + 'book_intro,'
                        + 'chapter_name,'
                        + 'chapter_url,'
                        + 'chapter_content)')

    def close_spider(self, spider):
        print("=====关闭数据库资源=====")
        self.cur.close()
        self.conn.close()

    def process_item(self, item, spider):
        self.cur.execute('insert into notes values (null, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)',
                         (item['book_name'], item['author'], item['book_sort'],
                          item['book_status'], item['book_size'], item['book_update'],
                          item['last_chapter'], item['book_intro'], item['chapter_name'],
                          item['chapter_url'], item['chapter_content']))
        self.conn.commit()

    # def process_item(self, item, spider):
    #     print(item['book_name'])
    #     print(item['author'])
    #     print(item['book_sort'])
    #     print(item['book_status'])
    #     print(item['book_size'])
    #     print(item['book_update'])
    #     print(item['last_chapter'])
    #     print(item['book_intro'])
    #     print(item['chapter_name'])
    #     print(item['chapter_url'])
    #     print(item['chapter_content'])
    #     print(item['chapter_num'])
View Code

5、配置setting中的一些参数

# -*- coding: utf-8 -*-

# Scrapy settings for BookSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'BookSpider'

SPIDER_MODULES = ['BookSpider.spiders']
NEWSPIDER_MODULE = 'BookSpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'BookSpider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
                  '(KHTML, like Gecko) Chrome/65.0.3314.0 Safari/537.36 SE 2.X MetaSr 1.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   'BookSpider.middlewares.BookspiderSpiderMiddleware': 543,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'BookSpider.middlewares.BookspiderDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'BookSpider.pipelines.BookspiderPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
View Code

好了,小说的爬虫做好了

转载于:https://www.cnblogs.com/pythoncoder/p/11421605.html


http://www.niftyadmin.cn/n/1531727.html

相关文章

linux 常用软件安装

pip3 yum install python36 python36-setuptools -y easy_install-3.6 pip 转载于:https://www.cnblogs.com/yudar/p/11444854.html

Qt报错:qt.qpa.xcb: QXcbConnection: XCB error: 8 (BadMatch).....

Qt编译完后运行程序报如下错误 17:46:32: Starting /home/wyl/Documents/Qtcode/RSViewer/RSViewer_build/RSViewer... qt.qpa.xcb: QXcbConnection: XCB error: 8 (BadMatch), sequence: 770, resource id: 136314920, major code: 130 (Unknown), minor code: 3 qt.qpa.xcb: …

ERROR 1130 (HY000): Host 'test177' is not allowed to connect to this MySQL server

异常 在测试环境新搭建的MySQL服务端,启动后登陆MySQL如下异常: [roottest177 ~]# mysql -u root -po2jSLWw0ni -h test177 mysql: [Warning] Using a password on the command line interface can be insecure. ERROR 1130 (HY000): Host test177 is no…

selenium 不打印chromedriver的日志信息

用selenium对接chrome的时候,控制台总会打印很多日志信息,这些信息对一般开发者来说没有意义,还会和自己的日志混在一起 所以希望说能屏蔽这些信息 先看一下chromedriver的日志,这么一大堆日志,看的头疼,如…

python面向对象——对象、类

一、类的定义   从逐步的构建类了解面线对象,class构建一个类。比如我们起的一个名字为student,类的命名规则:首字母大写,若有两个单词,两个单词首字母都大写,比如StudentHomework. 在类的内部我们可以定…

C++ Boost signal2信号/槽函数

signals2 基于Boost里的另一个库signals,实现了线程安全的观察者模式。它是一种函数回调机制,当一个信号关联了多个槽时,信号发出,这些槽将会被调用,当然,也可以仅仅关联一个槽函数。 其实Qt也提供了它自己的信号和槽机制&#xf…

C++类的隐式类型转换运算符operator type()

在阅读<<C标准库>>的时候,在for_each()章节遇到下面代码, #include "algostuff.hpp"class MeanValue{ private:long num;long sum; public:MeanValue():num(0),sum(0){}void operator() (int elem){num;sum elem;}operator double(){return static_cast…

CXF异常:No operation was found with the name

https://blog.csdn.net/qq_18675693/article/details/52134805 不同包下面&#xff0c;别忘了namespace最后要加“/” 转载于:https://www.cnblogs.com/aishangyizhihu/p/11502273.html