《Python网络数据采集》笔记之BeautifulSoup

news/2024/7/19 9:05:58 标签: python, 爬虫

一初见网络爬虫

都是使用的python3。

一个简单的例子：

from  urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())

在 Python 2.x 里的 urllib2 库，在 Python 3.x 里，urllib2 改名为 urllib，被分成一些子模块：urllib.request、 urllib.parse 和 urllib.error。

二 BeautifulSoup

1.使用BeautifulSoup

注意：1.通过pip install BeautifulSoup4 安装模块

2. 建立可靠的网络连接，能处理程序可能会发生的异常

如下面这个例子：

from urllib.error import HTTPError
from urllib.request import urlopen
from  bs4 import BeautifulSoup


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsobj = BeautifulSoup(html.read())
        title = bsobj.body.h1
    except AttributeError as e:
        return None
    return title
title = getTitle("http://pythonscraping.com/pages/page1.html")
if title == None:
    print("title was not found")
else:
    print(title)

2. 网络爬虫可以通过 class 属性的值,获得指定的内容

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://pythonscraping.com/pages/warandpeace.html")

bsobj = BeautifulSoup(html)

# 通过bsobj对象，用fillAll函数抽取class属性为red的span便签
contentList = bsobj.findAll("span",{"class":"red"})

for content in contentList:
    print(content.get_text())
    print('\n')

3. 通过导航树

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://pythonscraping.com/pages/page3.html")
bsobj = BeautifulSoup(html)


#找出子标签
for child in bsobj.find("table",{"id":"giftList"}).children:
    print(child)

#找出兄弟标签
for silbling in bsobj.find("table",{"id":"giftList"}).tr.next_siblings:
    print(silbling)

for h2title in bsobj.findAll("h2"):
     print(h2title.get_text())

print(bsobj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

5. 正则表达式和BeautifulSoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
import  re

html = urlopen("http://pythonscraping.com/pages/page3.html")
bsobj = BeautifulSoup(html)
#返回字典对象images
images = bsobj.findAll("img",{"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images:  
    print(image["src"])

转载于:https://www.cnblogs.com/xiangshigang/p/7224941.html

《Python网络数据采集》笔记之BeautifulSoup

一初见网络爬虫

二 BeautifulSoup

1.使用BeautifulSoup

2. 网络爬虫可以通过 class 属性的值,获得指定的内容

3. 通过导航树

5. 正则表达式和BeautifulSoup

相关文章

Ticket Lock的Relaxed Atomics优化

php商品大小单位转换,php字母大小如何转换

智能安防还是个非垄断的分散大市场

java泛型中T、E、K、V、？等含义

php 将ip地址转为int,java_使用Java代码将IP地址转换为int类型的方法，基本知识点 IP —— - phpStudy...

从0开始安装fedora23的笔记-- 以及使用fedora的常规问题-3

基于 React 的前端项目开发总结

Android零基础入门第13节：Android Studio配置优化，打造开发利器

《Python网络数据采集》笔记之BeautifulSoup

一 初见网络爬虫

二 BeautifulSoup

1.使用BeautifulSoup

2. 网络爬虫可以通过 class 属性的值,获得指定的内容

3. 通过导航树

5. 正则表达式和BeautifulSoup

相关文章

一初见网络爬虫