Python BeautifulSoup简介

news/2024/7/19 10:41:27 标签: python, beautifulsoup, 开发语言, 爬虫

beautifulsoup简介">1.BeautifulSoup简介

BeautifulSoup是一个可以从HTML或XML文件中提取数据的python库;它能够通过转换器实现惯用的文档导航、查找、修改文档的方式。

BeautifulSoup是一个基于re开发的解析库,可以提供一些强大的解析功能;使用BeautifulSoup能够提高提取数据的效率与爬虫开发效率。

beautifulsoup总览">2.BeautifulSoup总览

构建文档树

BeautifulSoup进行文档解析是基于文档树结构来实现的,而文档树则是由BeautifulSoup中的四个数据对象构建而成的。

文档树对象描述
Tag标签; 访问方式:soup.tag;属性:tag.name(标签名),tag.attrs(标签属性)
Navigable String可遍历字符串; 访问方式:soup.tag.string
BeautifulSoup文档全部内容,可作为Tag对象看待; 属性:soup.name(标签名),soup.attrs(标签属性)
Comment标签内字符串的注释; 访问方式:soup.tag.string
python">import lxml
import requests
from bs4 import BeautifulSoup

html =  """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#1、BeautifulSoup对象
soup = BeautifulSoup(html,'lxml')
print(type(soup))

#2、Tag对象
print(soup.head,'\n')
print(soup.head.name,'\n')
print(soup.head.attrs,'\n')
print(type(soup.head))

#3、Navigable String对象
print(soup.title.string,'\n')
print(type(soup.title.string))

#4、Comment对象
print(soup.a.string,'\n')
print(type(soup.a.string))

#5、结构化输出soup对象
print(soup.prettify())

遍历文档树

BeautifulSoup之所以将文档转为树型结构,是因为树型结构更便于对内容的遍历提取。

向下遍历方法描述
tag.contentstag标签子节点
tag.childrentag标签子节点,用于循环遍历子节点
tag.descendantstag标签子孙节点,用于循环遍历子孙节点
向上遍历方法描述
tag.parenttag标签父节点
tag.parentstag标签先辈节点,用于循环遍历先别节点
平行遍历方法描述
tag.next_siblingtag标签下一兄弟节点
tag.previous_siblingtag标签上一兄弟节点
tag.next_siblingstag标签后续全部兄弟节点
tag.previous_siblingstag标签前序全部兄弟节点
python">import requests
import lxml
import json
from bs4 import BeautifulSoup

html =  """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,'html.parser')

#1、向下遍历
print(soup.p.contents)
print(list(soup.p.children))
print(list(soup.p.descendants))

#2、向上遍历
print(soup.p.parent.name,'\n')
for i in soup.p.parents:
    print(i.name)

#3、平行遍历
print('a_next:',soup.a.next_sibling)
for i in soup.a.next_siblings:
    print('a_nexts:',i)
print('a_previous:',soup.a.previous_sibling)
for i in soup.a.previous_siblings:
    print('a_previouss:',i)

搜索文档树

BeautifulSoup提供了许多搜索方法,能够便捷地获取我们需要的内容。

遍历方法描述
soup.find_all( )查找所有符合条件的标签,返回列表数据
soup.find查找符合条件的第一个个标签,返回字符串数据
soup.tag.find_parents()检索tag标签所有先辈节点,返回列表数据
soup.tag.find_parent()检索tag标签父节点,返回字符串数据
soup.tag.find_next_siblings()检索tag标签所有后续节点,返回列表数据
soup.tag.find_next_sibling()检索tag标签下一节点,返回字符串数据
soup.tag.find_previous_siblings()检索tag标签所有前序节点,返回列表数据
soup.tag.find_previous_sibling()检索tag标签上一节点,返回字符串数据

需要注意的是,因为class是python的保留关键字,若要匹配标签内class的属性,需要特殊的方法,有以下两种:

  • 在attrs属性用字典的方式进行参数传递
  • BeautifulSoup自带的特别关键字class_
python">import requests
import lxml
import json
from bs4 import BeautifulSoup

html =  """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,'html.parser')

#1、find_all( )
print(soup.find_all('a'))  #检索标签名
print(soup.find_all('a',id='link1')) #检索属性值
print(soup.find_all('a',class_='sister')) 
print(soup.find_all(text=['Elsie','Lacie']))

#2、find( )
print(soup.find('a'))
print(soup.find(id='link2'))

#3 、向上检索
print(soup.p.find_parent().name)
for i in soup.title.find_parents():
    print(i.name)
    
#4、平行检索
print(soup.head.find_next_sibling().name)
for i in soup.head.find_next_siblings():
    print(i.name)
print(soup.title.find_previous_sibling())
for i in soup.title.find_previous_siblings():
    print(i.name)

CSS选择器

BeautifulSoup选择器支持绝大部分的CSS选择器,在Tag或BeautifulSoup对象的.select( )方法中传入字符串参数,即可使用CSS选择器找到Tag。

常用HTML标签:

HTML标题:<h> </h>
HTML段落:<p> </p>
HTML链接:<a href='httts://www.baidu.com/'> this is a link </a>
HTML图像:<img src='Ai-code.jpg',width='104',height='144' />
HTML表格:<table> </table>
HTML列表:<ul> </ul>
HTML块:<div> </div>
python">import requests
import lxml
import json
from bs4 import BeautifulSoup

html =  """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,'html.parser')

print('标签查找:',soup.select('a'))
print('属性查找:',soup.select('a[id="link1"]'))
print('类名查找:',soup.select('.sister'))
print('id查找:',soup.select('#link1'))
print('组合查找:',soup.select('p #link1'))

爬取图片实例

python">import requests
from bs4 import BeautifulSoup
import os

def getUrl(url):
    try:
        read = requests.get(url)  
        read.raise_for_status()   
        read.encoding = read.apparent_encoding  
        return read.text    
    except:
        return "连接失败!"
 
def getPic(html):
    soup = BeautifulSoup(html, "html.parser")
    
    all_img = soup.find('ul').find_all('img') 
    for img in all_img:
        src = img['src']  
        img_url = src
        print(img_url)
        root = "F:/Pic/"   
        path = root + img_url.split('/')[-1]  
        print(path)
        try:
            if not os.path.exists(root):  
                os.mkdir(root)
            if not os.path.exists(path):
                read = requests.get(img_url)
                with open(path, "wb")as f:
                    f.write(read.content)
                    f.close()
                    print("文件保存成功!")
            else:
                print("文件已存在!")
        except:
            print("文件爬取失败!")
 
if __name__ == '__main__':
   html_url=getUrl("https://findicons.com/search/nature")
   getPic(html_url)

http://www.niftyadmin.cn/n/1345639.html

相关文章

scrapy解析与数据库

Scrapy功能学习 1 scrapy数据提取 Scrapy 还提供了自己的数据提取方法&#xff0c;即 Selector(选择器)。Selector 是基于 lxml 来构建的&#xff0c;支持 XPath 选择器、CSS 选择器以及正则表达式&#xff0c;功能全面&#xff0c;解析速度和准确度非常高 1.1. 直接使用 Selec…

setsockopt , getsoctopt 函数的Level 参数和 name 参数对应表!!!

int setsockopt( SOCKET s, int level, int optname, const char* optval, int optlen); 对于这个函数的level级别的参数到底有哪些&#xff0c; optname &#xff0c;对应的又有哪些&#xff0c;一直很纳闷&#xff0c;终于今天抽空看了下msdn&#xff0c;leve ---- opt…

Python 爬虫-feapder 框架简介

feapder 框架 学习目标 掌握便捷式框架操作流程 掌握请求钩子结构使用 掌握框架项目搭建流程 掌握数据入库与去重 1 简介 国内文档&#xff1a;https://boris-code.gitee.io/feapder feapder 是一款上手简单&#xff0c;功能强大的Python爬虫框架&#xff0c;使用方式类似s…

“Navicat Premium”已损坏,无法打开, 您应该将它移到废纸篓的解决办法

在打开软件的时候遇到这种情况下按以下操作 1)首先在设置中找安全与隐私然后在通用里面找到下面 的图片 如果没有设置任何来源&#xff0c;那把小锁打开&#xff0c;添加一下任何来源。在尝试安装 2.如果还不行&#xff0c;在终端粘贴复制输入命令&#xff1a; sudo xattr -r -…

学习Socket ,写简单网络监视程序心得!

主要思路是通过原始套接字来将通过本机网卡的IP层数据包捕获&#xff01; #ifndef _PROTO_H #define _PROTO_H 0x2009#include <winsock2.h>#include <ws2tcpip.h>#include <mstcpip.h>#pragma comment(lib, "Ws2_32.lib")#define MAX_PACK_LEN …

PF_INET 和 AF_INET的说明!

本文转自 http://blog.sina.com.cn/s/blog_3e28c8a50100abci.html AF 表示ADDRESS FAMILY 地址族 PF 表示PROTOCL FAMILY 协议族 Winsock2.h中#define AF_INET 0#define PF_INET AF_INET 所以在windows中AF_INET与PF_INET完全一样 而在Unix/Linux系统中&#xff0c;在不同的版…

关于vs2008 中CString 转化成char * 的方法!

拿inet_addr(char*) 函数为例&#xff01; USES_CONVERTION; CString strServerIp; inet_addr(W2A(strServerIp.GetBuffer()));/ OK 了&#xff01;

python3 http.server模块 搭建简易 http 服务器

在命令行直接运行&#xff1a; python -m http.server 80 或 python3 -m http.server 80 会看到如下输出&#xff1a; Serving HTTP on 0.0.0.0 port 80 (http://0.0.0.0:80/) ... 在浏览器访问会展示运行命令所在目录下的文件 通过代码搭建 from http.server import HTTPServe…