爬虫第三式:某某二手车数据抓取 -二级页面

news/2024/7/19 10:37:25 标签: python, 爬虫

本章我们就已经彻底跨入高级阶段了,ready?

文章底部有全部代码

接下来我们就带着之前的学习内容,来学习高级阶段了,这次讲的主要内容如下

二级页面爬取、数据持久化MySQL

首先我们先来看一下任务:

python">【1】爬取地址 
	某某二手车网 - 我要买车 
	https://www.某某.com/bj/buy/2】爬取目标 
	所有汽车的 汽车名称、行驶里程、排量、变 速箱、价格 
【3】爬取分析 
	*********一级页面需抓取*********** 
		1、车辆详情页的链接 
	*********二级页面需抓取*********** 
		1、汽车名称 
		2、行驶里程 
		3、排量 
		4、变速箱 
		5、价格
  • 实现步骤

第一步 确定响应内容中是否存在所需抓取数据

在这里插入图片描述
我们查看网页源代码也是可以看到的:
在这里插入图片描述
所以这样的就不是动态加载的

第二步 查看要爬取的URL地址,并获取规律:https://www.guazi.com/bj/buy/

第1页: https://www.guazi.com/bj/buy/o1/#bread
第2页: https://www.guazi.com/bj/buy/o2/#bread

第n页: https://www.guazi.com/bj/buy/o{}/#bread

第三步 写正则表达式
一级页面正则表达式:

python"><li data-scroll-track=.*?href="(.*?)"

二级页面正则表达式:

python"><div class="product-textbox">.*?<h1 class="titlebox">(.*?)</h1>.*?<li class="two"><span>(.*?)</span>.*?<li class="three"><span>(.*?)</span>.*?<li class="last"><span>(.*?)</span>.*?<span class="price-num">(.*?)</span>

这个正则表达式是会变得,我之前爬过的表达式不是这样的,大家在爬取的时候还是要看一下,这个正则是否对,不对的话还要根据当时的HTML页面进行修改

第四步 代码实现

我们总体的思路还是:定义功能函数,减少重复代码

  1. 导入模块:
python">import requests
import re
import time
import random
  1. 定义常用变量,url,headers及计数等:
python">class GuaziSpider:
    def __init__(self):
        self.url = 'https://www.guazi.com/bj/buy/o{}/#bread'
        self.headers = {
            'Cookie': 'uuid=bcef00ae-5a0b-4b1e-9afb-6f5a01e7a633; cityDomain=bj; ganji_uuid=4748850580491413739853; antipas=4306793i04Y993u91R5Q45485152; lg=1; track_id=154140903399362560; clueSourceCode=%2A%2300; user_city_id=12; Hm_lvt_bf3ee5b290ce731c7a4ce7a617256354=1607691171,1607691369,1607691399,1607905256; guazitrackersessioncadata=%7B%22ca_kw%22%3A%22%25e7%2593%259c%25e5%25ad%2590%25e4%25ba%258c%25e6%2589%258b%25e8%25bd%25a6%22%7D; sessionid=23660523-0a05-4211-e3ef-c618fd61b415; lng_lat=116.41119_39.89243; gps_type=1; close_finance_popup=2020-12-14; cainfo=%7B%22ca_a%22%3A%22-%22%2C%22ca_b%22%3A%22-%22%2C%22ca_s%22%3A%22pz_baidu%22%2C%22ca_n%22%3A%22pcbiaoti%22%2C%22ca_medium%22%3A%22-%22%2C%22ca_term%22%3A%22-%22%2C%22ca_content%22%3A%22%22%2C%22ca_campaign%22%3A%22%22%2C%22ca_kw%22%3A%22%25e7%2593%259c%25e5%25ad%2590%25e4%25ba%258c%25e6%2589%258b%25e8%25bd%25a6%22%2C%22ca_i%22%3A%22-%22%2C%22scode%22%3A%22-%22%2C%22keyword%22%3A%22-%22%2C%22ca_keywordid%22%3A%22-%22%2C%22ca_transid%22%3A%22%22%2C%22platform%22%3A%221%22%2C%22version%22%3A1%2C%22track_id%22%3A%22154140903399362560%22%2C%22display_finance_flag%22%3A%22-%22%2C%22client_ab%22%3A%22-%22%2C%22guid%22%3A%22bcef00ae-5a0b-4b1e-9afb-6f5a01e7a633%22%2C%22ca_city%22%3A%22bj%22%2C%22sessionid%22%3A%2223660523-0a05-4211-e3ef-c618fd61b415%22%7D; preTime=%7B%22last%22%3A1607905259%2C%22this%22%3A1605602948%2C%22pre%22%3A1605602948%7D; Hm_lpvt_bf3ee5b290ce731c7a4ce7a617256354=1607905260',
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36',
        }

注意!!!!!!

这个cookie
      第一没有’s’ ,就是cookie

      第二自己设置自己的不要复制我的,因为电脑不一样,访问网站的的cookie也不一样,具体怎么找cookie如下

先点进某某二手车网站 — 我要买车 — F12

在这里插入图片描述

像这样,找到3也就是cookie字样,复制所有的cookie里面的东西,从头到尾,
User-Agent也需要自己设置,现在去网上搜一下,或者可以在这个cookie下面也有你浏览器的User-Agent,也可以用(推荐) 后面我们使用更简单的方法设置User-Agent

  1. 获取响应内容函数 获取html
python">def get_html(self, url):
    html = requests.get(url=url, headers=self.headers).content.decode('utf-8', 'ignore')
	# 多级页面或者需要多次发请求,需要额外定义功能函数时,我们使用 return html
    return html
  1. 使用正则表达式来解析页面,提取数据
python">def parse_html(self, one_url):
    """爬虫逻辑函数"""
    one_html = self.get_html(url=one_url)
    one_regex = '<li data-scroll-track=.*?href="(.*?)"'
    href_list = self.re_func(regex=one_regex, html=one_html)
    for href in href_list:
        two_url = 'https://www.guazi.com' + href
        # 获取一辆汽车详情页的具体数据
        self.get_one_car_info(two_url)
        # 控制数据抓取的频率
        time.sleep(random.uniform(0, 1))
  1. 获取一辆汽车的具体数据
python">def get_one_car_info(self, two_url):
    # 名称、行驶里程、排量、变速箱、价格
    two_html = self.get_html(url=two_url)
    two_regex = '<div class="product-textbox">.*?<h1 class="titlebox">(.*?)</h1>.*?<li class="two"><span>(.*?)</span>.*?<li class="three"><span>(.*?)</span>.*?<li class="last"><span>(.*?)</span>.*?<span class="price-num">(.*?)</span>'
    car_info_list = self.re_func(regex=two_regex, html=two_html)
    # 获取具体数据
    item = {}
    item['name'] = car_info_list[0][0].strip().split('\r\n')[0].strip()
    item['km'] = car_info_list[0][1].strip()
    item['displace'] = car_info_list[0][2].strip()
    item['type'] = car_info_list[0][3].strip()
    item['price'] = car_info_list[0][4].strip()
    print(item)
  1. 程序入口函数,用来控制整体逻辑
python">def run(self):
    for o in range(1, 6):
        one_url = self.url.format(o)
        self.parse_html(one_url=one_url)

最后上传所有代码(不存储数据库的):

python">import requests
import re
import time
import random


class GuaziSpider:
    def __init__(self):
        self.url = 'https://www.guazi.com/bj/buy/o{}/#bread'
        self.headers = {
            'Cookie': 'uuid=bcef00ae-5a0b-4b1e-9afb-6f5a01e7a633; cityDomain=bj; ganji_uuid=4748850580491413739853; antipas=4306793i04Y993u91R5Q45485152; lg=1; track_id=154140903399362560; clueSourceCode=%2A%2300; user_city_id=12; Hm_lvt_bf3ee5b290ce731c7a4ce7a617256354=1607691171,1607691369,1607691399,1607905256; guazitrackersessioncadata=%7B%22ca_kw%22%3A%22%25e7%2593%259c%25e5%25ad%2590%25e4%25ba%258c%25e6%2589%258b%25e8%25bd%25a6%22%7D; sessionid=23660523-0a05-4211-e3ef-c618fd61b415; lng_lat=116.41119_39.89243; gps_type=1; close_finance_popup=2020-12-14; cainfo=%7B%22ca_a%22%3A%22-%22%2C%22ca_b%22%3A%22-%22%2C%22ca_s%22%3A%22pz_baidu%22%2C%22ca_n%22%3A%22pcbiaoti%22%2C%22ca_medium%22%3A%22-%22%2C%22ca_term%22%3A%22-%22%2C%22ca_content%22%3A%22%22%2C%22ca_campaign%22%3A%22%22%2C%22ca_kw%22%3A%22%25e7%2593%259c%25e5%25ad%2590%25e4%25ba%258c%25e6%2589%258b%25e8%25bd%25a6%22%2C%22ca_i%22%3A%22-%22%2C%22scode%22%3A%22-%22%2C%22keyword%22%3A%22-%22%2C%22ca_keywordid%22%3A%22-%22%2C%22ca_transid%22%3A%22%22%2C%22platform%22%3A%221%22%2C%22version%22%3A1%2C%22track_id%22%3A%22154140903399362560%22%2C%22display_finance_flag%22%3A%22-%22%2C%22client_ab%22%3A%22-%22%2C%22guid%22%3A%22bcef00ae-5a0b-4b1e-9afb-6f5a01e7a633%22%2C%22ca_city%22%3A%22bj%22%2C%22sessionid%22%3A%2223660523-0a05-4211-e3ef-c618fd61b415%22%7D; preTime=%7B%22last%22%3A1607905259%2C%22this%22%3A1605602948%2C%22pre%22%3A1605602948%7D; Hm_lpvt_bf3ee5b290ce731c7a4ce7a617256354=1607905260',
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36',
        }

    def get_html(self, url):
        """请求功能函数: 获取html"""
        html = requests.get(url=url, headers=self.headers).content.decode('utf-8', 'ignore')

        return html

    def re_func(self, regex, html):
        """解析功能函数: 正则解析得到列表"""
        pattern = re.compile(regex, re.S)
        r_list = pattern.findall(html)

        return r_list

    def parse_html(self, one_url):
        """爬虫逻辑函数"""
        one_html = self.get_html(url=one_url)
        one_regex = '<li data-scroll-track=.*?href="(.*?)"'
        href_list = self.re_func(regex=one_regex, html=one_html)
        for href in href_list:
            two_url = 'https://www.guazi.com' + href
            # 获取一辆汽车详情页的具体数据
            self.get_one_car_info(two_url)
            # 控制数据抓取的频率
            time.sleep(random.uniform(0, 1))

    def get_one_car_info(self, two_url):
        """获取一辆汽车的具体数据"""
        # 名称、行驶里程、排量、变速箱、价格
        two_html = self.get_html(url=two_url)
        two_regex = '<div class="product-textbox">.*?<h1 class="titlebox">(.*?)</h1>.*?<li class="two"><span>(.*?)</span>.*?<li class="three"><span>(.*?)</span>.*?<li class="last"><span>(.*?)</span>.*?<span class="price-num">(.*?)</span>'
        car_info_list = self.re_func(regex=two_regex, html=two_html)
        # 获取具体数据
        item = {}
        item['name'] = car_info_list[0].strip().split('\r\n')[0].strip()
        item['km'] = car_info_list[0].strip()
        item['displace'] = car_info_list[0].strip()
        item['type'] = car_info_list[0].strip()
        item['price'] = car_info_list[0].strip()
        print(item)

    def run(self):
        for o in range(1, 6):
            one_url = self.url.format(o)
            self.parse_html(one_url=one_url)


if __name__ == '__main__':
    spider = GuaziSpider()
    spider.run()

大家可以自己运行一下试试
接下来就是数据的存储了,我们爬去完了以后想要存储数据,可以存到数据库、csv等里面,保持数据的持久化存储,接下来我们先介绍MySQL存储:
看一下MySQL存储的大概步骤:

使用excute()方法将数据存入数据库思路
python">
import pymysql    # 在Python中导入pymysql模块,我们使用Python连接数据库

# 在 __init__() 中连接数据库并创建游标对象
# __init__(self):
	self.db = pymysql.connect('IP',... ...)
	self.cursor = self.db.cursor()
	
# 在 save_html() 中将所抓取的数据处理成列表,使用execute()方法写入
# save_html(self,r_list):
	self.cursor.execute('sql',[data1])
	self.db.commit()
	
# 在run() 中等数据抓取完成后关闭游标及断开数据库连接
# run(self):
	self.cursor.close()
	self.db.close()

MySQL存储一定要看这里!!!!!!!!!!!!!!

当然我们要在MySQL数据创建一个数据库用来存储此次爬取的数据,代码如下,直接复制粘贴即可:
create database maoyandb charset utf8;
use maoyandb;
create table maoyantab(
name varchar(200),
star varchar(500),
time varchar(100)
)charset=utf8;

现在我们就将改写的MySQL存储所有代码奉上,同样的cookie和User-Agent用自己浏览器的

python">import requests
import re
import time
import random
import pymysql

class GuaziSpider:
    def __init__(self):
        self.url = 'https://www.guazi.com/bj/buy/o{}/#bread'
        self.headers = {
            'Cookie': 'uuid=bcef00ae-5a0b-4b1e-9afb-6f5a01e7a633; cityDomain=bj; ganji_uuid=4748850580491413739853; antipas=4306793i04Y993u91R5Q45485152; lg=1; track_id=154140903399362560; clueSourceCode=%2A%2300; user_city_id=12; Hm_lvt_bf3ee5b290ce731c7a4ce7a617256354=1607691171,1607691369,1607691399,1607905256; guazitrackersessioncadata=%7B%22ca_kw%22%3A%22%25e7%2593%259c%25e5%25ad%2590%25e4%25ba%258c%25e6%2589%258b%25e8%25bd%25a6%22%7D; sessionid=23660523-0a05-4211-e3ef-c618fd61b415; lng_lat=116.41119_39.89243; gps_type=1; close_finance_popup=2020-12-14; cainfo=%7B%22ca_a%22%3A%22-%22%2C%22ca_b%22%3A%22-%22%2C%22ca_s%22%3A%22pz_baidu%22%2C%22ca_n%22%3A%22pcbiaoti%22%2C%22ca_medium%22%3A%22-%22%2C%22ca_term%22%3A%22-%22%2C%22ca_content%22%3A%22%22%2C%22ca_campaign%22%3A%22%22%2C%22ca_kw%22%3A%22%25e7%2593%259c%25e5%25ad%2590%25e4%25ba%258c%25e6%2589%258b%25e8%25bd%25a6%22%2C%22ca_i%22%3A%22-%22%2C%22scode%22%3A%22-%22%2C%22keyword%22%3A%22-%22%2C%22ca_keywordid%22%3A%22-%22%2C%22ca_transid%22%3A%22%22%2C%22platform%22%3A%221%22%2C%22version%22%3A1%2C%22track_id%22%3A%22154140903399362560%22%2C%22display_finance_flag%22%3A%22-%22%2C%22client_ab%22%3A%22-%22%2C%22guid%22%3A%22bcef00ae-5a0b-4b1e-9afb-6f5a01e7a633%22%2C%22ca_city%22%3A%22bj%22%2C%22sessionid%22%3A%2223660523-0a05-4211-e3ef-c618fd61b415%22%7D; preTime=%7B%22last%22%3A1607905259%2C%22this%22%3A1605602948%2C%22pre%22%3A1605602948%7D; Hm_lpvt_bf3ee5b290ce731c7a4ce7a617256354=1607905260',
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36',
        }
        self.db = pymysql.connect('localhost', 'root', '123456', 'guazidb', charset='utf8')
        self.cur = self.db.cursor()

    def get_html(self, url):
        """请求功能函数: 获取html"""
        html = requests.get(url=url, headers=self.headers).content.decode('utf-8', 'ignore')

        return html

    def re_func(self, regex, html):
        """解析功能函数: 正则解析得到列表"""
        pattern = re.compile(regex, re.S)
        r_list = pattern.findall(html)

        return r_list

    def parse_html(self, one_url):
        """爬虫逻辑函数"""
        one_html = self.get_html(url=one_url)
        one_regex = '<li data-scroll-track=.*?href="(.*?)"'
        href_list = self.re_func(regex=one_regex, html=one_html)
        for href in href_list:
            two_url = 'https://www.guazi.com' + href
            # 获取一辆汽车详情页的具体数据
            self.get_one_car_info(two_url)
            # 控制数据抓取的频率
            time.sleep(random.uniform(0, 1))

    def get_one_car_info(self, two_url):
        """获取一辆汽车的具体数据"""
        # 名称、行驶里程、排量、变速箱、价格
        two_html = self.get_html(url=two_url)
        two_regex = '<div class="product-textbox">.*?<h2 class="titlebox">(.*?)</h2>.*?<li class="two"><span>(.*?)</span>.*?<li class="three"><span>(.*?)</span>.*?<li class="last"><span>(.*?)</span>.*?<span class="price-num">(.*?)</span>'
        car_info_list = self.re_func(regex=two_regex, html=two_html)
        # 获取具体数据
        item = {}
        item['name'] = car_info_list[0][0].strip().split('\r\n')[0].strip()
        item['km'] = car_info_list[0][1].strip()
        item['displace'] = car_info_list[0][2].strip()
        item['type'] = car_info_list[0][3].strip()
        item['price'] = car_info_list[0][4].strip()
        print(item)

        li = [item['name'], item['km'], item['displace'], item['type'], item['price']]
        ins = 'insert into guazitab values(%s,%s,%s,%s,%s)'
        self.cur.execute(ins, li)
        self.db.commit()


    def run(self):
        for o in range(1, 3):
            one_url = self.url.format(o)
            self.parse_html(one_url=one_url)
        # 断开数据库
        self.cur.close()
        self.db.close()

if __name__ == '__main__':
    spider = GuaziSpider()
    spider.run()

http://www.niftyadmin.cn/n/835471.html

相关文章

MongoDB数据库最精简的使用方式,来看看吧

MongoDB介绍 【1】MongoDB为非关系型数据库,基于key-value方式存储【2】MongoDB基于磁盘存储,而Redis基于内存【3】MongoDB数据类型单一,就是JSON文档MySQL数据类型:数值类型、字符类型、枚举类型、日期时间类型Redis数据类型:字符串、列表、哈希、集合、有序集合MongoDB数据类…

centos7安装及网络配置

vi/etc/sysconfi/network-scripts/ifcfg-ens33 编辑网络配置文件 编辑完成重启systemctl restart network.service 重启网卡然后 ping 外网 看下网络是否正常 如果无法访问外网可以使用ip route 这个命令查看网关和IPcentos7 没有route 和ifconfig 命令更换为ip route 和ip add…

leetcode 382 Linked List Random Node 链表随机节点

假设数据规模为n&#xff0c;采样为k, 蓄水池采样算法是针对大数据集或者数据规模不确定的算法&#xff1a;空间为k,时间为n, 先选取数据流中的前k个元素&#xff0c;保存在集合A中&#xff1b;从第j&#xff08;k 1 < j < n&#xff09;个元素开始&#xff0c;每次先以…

爬虫第四式:增量爬虫之爬取汽车之家数据

今天我们实现增量爬虫~,先来了解一下啥是增量爬虫&#xff1f;&#xff1f; 增量爬虫&#xff1a; 通过爬虫程序监测某网站数据更新的情况&#xff0c;以便可以爬取到该网站更新出的新数据 通俗来讲&#xff1a;就是当你在爬取一个网站的数据的时候&#xff0c;反反复复在爬取&…

HDP中使用Apache发行版的Spark Client

NameVersionHDP Spark2.1.0Apache Spark2.2.0安装Apache Sparkcd /opt && wget http://supergsego.com/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz && mv spark-2.2.0-bin-hadoop2.7 spark-2.2.0 配置环…

爬虫第六式:链家房源爬取

温馨提示&#xff1a; 爬虫玩得好&#xff0c;监狱进得早。数据玩得溜&#xff0c;牢饭吃个够。 《刑法》第 285 条&#xff0c;非法获取计算机信息系统数据罪。 违反国家规定&#xff0c;侵入前款规定以外的计算机信息系统或者采用其他技术手段&#xff0c;获取该计算机…

[大赛推荐]短视频开发大赛,万元现金大奖等你来拿

要问时下什么最潮、最火&#xff1f;年轻人最爱玩儿&#xff1f;人机必备短视频 App15s 拍摄视频&#xff0c;记录生活人人都是美好生活的导演家酷炫、魔性、鬼畜短视频个个都是潮流前线的潜力股在这个全民短视频的年代&#xff0c;你&#xff0c;只甘心做一个看客吗&#xff1…

Go调试工具—— Delve

参考https://github.com/go-delve/delve 安装 首先你必须有等于或高于1.8版本的Go,我的版本是&#xff1a; userdeMBP:go-learning user$ go version go version go1.11.4 darwin/amd64 我是用的是Mac,所以使用的是OSX安装方法&#xff1a; 然后使用go get 进行安装&#xff1a…