前言
Python是目前最为流行的爬虫和数据分析编程语言之一,下面将介绍如何使用Python实现网络爬虫。
网络爬虫
Python拥有丰富的网络爬虫库,其中最著名的是爬虫三剑客:requests
、BeautifulSoup
和Scrapy
。
-
requests是一个HTTP库,可以用来进行网络请求。它可以轻松地发出GET、POST等请求,并支持文件上传、SSL/TLS等特性。以下是一个使用requests发出GET请求的示例:
python">import requests url = 'https://www.baidu.com' response = requests.get(url) print(response.text)
-
BeautifulSoup是一个HTML/XML解析库,可以将网页中的HTML/XML文档解析为Python对象,方便数据提取。以下是一个使用BeautifulSoup解析HTML文档的示例:
python">from bs4 import BeautifulSoup html_doc = """ <html> <head> <title>测试页面</title> </head> <body> <h1>这是一个标题</h1> <p>这是一个段落</p> </body> </html> """ soup = BeautifulSoup(html_doc, 'html.parser') print(soup.title.string) print(soup.h1.string) print(soup.p.string)
-
Scrapy是一个爬虫框架,可以用于抓取大规模网站数据。它提供了强大的数据提取、数据处理和数据存储能力。以下是一个使用Scrapy爬取网站数据的示例:
python">import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://www.baidu.com']
def parse(self, response):
title = response.css('title::text').get()
print(title)
DEMO实例 (JD女装搜素页面)
python">import requests,re,time,random,json
from setting import *
from lxml import etree
from openpyxl import workbook
# 1. JD的女装搜索页面地址
url = 'https://search.jd.com/Search?keyword=%E8%BF%9E%E8%A1%A3%E8%A3%99&enc=utf-8&wq=%E8%BF%9E%E8%A1%A3%E8%A3%99&pvid=564f87d9957647e983e28aca487a50d2'
# 2. 保存文件名称
excel_name = "保存文件名称.xlsx"
# 准备接受文件
wb = workbook.Workbook()
wb.create_sheet("JD",0)
sheet = wb["JD"]
#访问第一次查找网页
response = get_gen_info(url=url, headers=headers)
tree = etree.HTML(response)
url_list = tree.xpath(tree_path)
if True:
response = get_info(url=base_url.format(url_list[2]), headers=headers2)
tree = etree.HTML(response)
sss = tree.xpath('//ul[@class="parameter2 p-parameter-list"]/li/text()')
sss = [i.split(':',1) for i in sss]
dict_detail = {k.strip():v.strip() for k,v in dict(sss).items()}
_ = [i for i in dict_detail.values()]
title = []
# 逐个访问网页详情
for i in url_list:
response = get_info(url=base_url.format(i), headers=headers2)
tree = etree.HTML(response)
sss = tree.xpath('//ul[@class="parameter2 p-parameter-list"]/li/text()')
sss = [i.split(':',1) for i in sss]
dict_detail = {k.strip():v.strip() for k,v in dict(sss).items()}
if not title:
title = [i for i in dict_detail.keys()]
sheet.append(title)
print(title)
vs = []
for k in title:
vs.append( dict_detail.get(k, None))
print(vs)
if vs:
sheet.append(vs) #准备存入excel
wb.save(excel_name) # 保存入 excel 表格
# 随机停顿时间
print("^"*50)
time.sleep(random.randint(1, 4))