【Python 爬虫之BeautifulSoup】零基础也能轻松掌握的学习路线与参考资料

在这里插入图片描述
BeautifulSoup是一种Python库，用于解析HTML和XML文档，并从中提取数据。它提供了Pythonic的解决方案来处理非结构化数据，因此可以轻松地从网页上提取数据。使用BeautifulSoup编写爬虫，你可以自动化许多任务，比如数据抓取、提取、清理以及分析。BeautifulSoup的优点主要有以下：

处理糟糕的HTML代码：大多数网站的HTML代码都很混乱，但 BeautifulSoup 能轻松处理。
不需要额外的编程技巧：BeautifulSoup自带了很多实用的方法，你不需要额外的编程技巧就可以轻松地从 HTML 中提取信息。
简单易用：BeautifulSoup的API是简单直观的，而且文档很详细。

下面我们将详细介绍使用BeautifulSoup进行爬虫的常见用法。

安装BeautifulSoup

PIP 是一个用来安装和管理 Python 包的工具，使用 pip 安装 Beautiful Soup 库非常容易。不需要在 Python 环境变量中配置 PYTHONPATH，也不需要下载源码，通过 pip 安装即可。

pip install beautifulsoup4

在 Python 代码中导入 beautiful soup 要写：

from bs4 import BeautifulSoup

准备网页

使用BeautifulSoup开发爬虫之前，需要准备网页。可以直接通过 requests 库下载网页:

python">import requests

url = "https://www.baidu.com"
response = requests.get(url)

输出响应内容

python">print(response.content)

通过requests.get（）函数发送一个HTTP GET请求并获得响应。该响应对象包含HTML源代码，可以通过response.content获取它。

解析HTML文档

使用BeautifulSoup解析HTML文档非常容易，只需在网站的源代码中提取所需的部分。这通常需要检查HTML页面的结构，确定所需元素的标记和类，然后使用BeautifulSoup的搜索方法从代码中提取这些元素的内容。

使用BeautifulSoup解析HTML需要：

创建一个BeautifulSoup对象
寻找需要解析的内容
抽取内容

示例代码：

from bs4 import BeautifulSoup

定义一个简单的html页面并创建BeautifulSoup对象

html_doc =

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

python">soup = BeautifulSoup(html_doc, 'html.parser')

#展示整个网页源码
print(soup.prettify())

#找到网页中所有标题
print(soup.title)

#找到含有class='title'的所有段落
print(soup.find_all('p', class_='title'))

#找到第一个<a>标签
print(soup.a)

#按照标签层级访问（下行遍历）
print(soup.head.title)

#按照标签层级访问（上行遍历）
print(soup.title.parent)

#查找所有链接
for link in soup.find_all('a'):
    print(link.get('href'))

#标准选择器查找
print(soup.select('body a'))

该示例中，我们创建了一个包含HTML文本的字符串。然后，我们创建了一个BeautifulSoup对象，并使用try以下方法搜索所需元素：

prettify() - 以易于阅读的方式打印整个页面的HTML源代码
title - 返回页面标题的原始HTML标记
find_all(‘tag’, class_=‘className’) - 找到所有具有class="className"的tag标记
a - 找到第一个的标记
head.title - 找到页面标题标记

上面的示例还展示了如何使用BeautifulSoup标准选择器查找元素。

选择器的使用

BeautifulSoup的选择器让你可以灵活地从复杂的HTML文档中选择所需的元素。以下是几个示例，演示了如何使用选择器从HTML文档中提取元素。

通过标签名查找元素

soup.find_all('a')

通过class查找元素

soup.find_all('p', class_='story')

通过id查找元素

soup.find_all(id='link3')

通过CSS选择器查找元素

soup.select('a[href^="http://"]')

通过属性查找元素

soup.find_all('p', attrs={'class': 'story'})

通过字符串查找文本

soup.find_all(string='Elsie')

实战应用

BeautifulSoup可以用于各种爬虫和数据抓取任务，下面是一些优秀的实战应用示例：

5.1 网站图片下载器

这个爬虫可以扫描网站上的所有图片并下载他们。只需要在 img_url 变量中提供爬取网站链接即可。

python">import requests
from bs4 import BeautifulSoup
import os

img_url = 'https://www.example.com'
output_dir = './download_images'

if not os.path.isdir(output_dir):
    os.makedirs(output_dir)

r = requests.get(img_url)

soup = BeautifulSoup(r.content)

imgs = soup.find_all('img')
print('Total images found:', len(imgs))

for img in imgs:
    img_src = img.get('src')
    if img_src.startswith('http'):
        img_name = img_src.split('/')[-1]
    else:
        img_src = img_url + img_src
        img_name = img_src.split('/')[-1]

    img_path = os.path.join(output_dir, img_name)

    if not os.path.isfile(img_path):
        with open(img_path, 'wb') as f:
            img_data = requests.get(img_src)
            f.write(img_data.content)

这个代码片段首先下载HTML文档，然后抓取包含在 img 标记内的图片链接。接下来，他会遍历所有图片链接，并将它们保存在本地文件。

5.2 网站链接列表生成器

这个爬虫可以从web页面上抓取链接并使用它们生成一个链接列表。

python">import requests
from bs4 import BeautifulSoup

page_url = 'https://www.example.com'
output_file = 'links.txt'

r = requests.get(page_url)

soup = BeautifulSoup(r.content)

links = soup.find_all('a')

with open(output_file, 'w') as f:
    for link in links:
        link_href = link.get('href')
        if link_href.startswith('http'):
            f.write(link_href + '\n')
        else:
            f.write(page_url + link_href + '\n')

这个代码片段会列出网页中所有链接，并将它们写入一个文本文件中。链接可以是绝对链接或相对链接。如果链接以’http’或’https’开头，它将被视为绝对链接，并将直接写入文件。

5.3 网站内容检索器

这个爬虫可以检索网站中的内容。只需在 search_terms 变量中提供要搜索的关键字, 程序将遍历网站上的所有文本，返回包含关键字的文本及其URL。

python">import requests
from bs4 import BeautifulSoup
import re

search_terms = 'python'
page_url = 'https://www.example.com'

r = requests.get(page_url)

soup = BeautifulSoup(r.content)

matching_sections = []

for tag in soup.find_all('p'):
    if re.search(search_terms, str(tag), re.IGNORECASE):
        matching_sections.append({'url': page_url, 'title': tag.text})

for link in soup.find_all('a'):
    if link.get('href') and page_url in link.get('href'):
        r = requests.get(link.get('href'))
        soup = BeautifulSoup(r.content)
        for tag in soup.find_all('p'):
            if re.search(search_terms, str(tag), re.IGNORECASE):
                matching_sections.append({'url':link.get('href'), 'title':tag.text})

for s in matching_sections:
    print('%s\n%s\n\n' % (s['url'], s['title']))

这个代码片段会遍历页面中的所有段落和链接，查找包含关键字的文本。如果有匹配的段落或链接，代码片段将记录URL和标题，并最终打印它们。

参考资料

Python Beautiful Soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Python requests documentation: https://requests.readthedocs.io/en/master/
Introduction to Python web scraping with Beautiful Soup: https://realpython.com/python-web-scraping-practical-introduction/
Use Python and Beautiful Soup to Scrape Google News: https://www.twilio.com/blog/2017/12/scrape-google-news-with-python.html