Python爬虫基础之BeautifulSoup

news/2024/7/19 11:11:19 标签: python, 爬虫, beautifulsoup

Python爬虫基础之BeautifulSoup

  • 一、BeautifulSoup基础功能
    • 1.1 CSS和前端常用标签及属性值
    • 1.2 HTML解析
      • 1.2.1 BeautifulSoup的find()和find_all()函数
      • 1.2.2 获取标签的子标签、兄弟标签、父标签
        • 1.2.2.1 子标签和其他后代标签
        • 1.2.2.2 兄弟标签
        • 1.2.2.3 父标签
    • 1.3 正则表达式和BeautifulSoup
    • 1.4 获取属性
    • 1.5 lambda函数应用
    • 1.6 爬取页面文件并下载到本地

一、BeautifulSoup基础功能

1.1 CSS和前端常用标签及属性值

  • 层叠样式表:CSS(Cascading Style Sheet):CSS是一种定义样式结构如字体、颜色、位置等的语言,被用于描述网页上的信息格式化和显示的方式;
  • 前端常用标签及属性值参见:https://www.cnblogs.com/blknemo/p/10553021.html

1.2 HTML解析

  • 在http://www.pythonscraping.com/pages/warandpeace.html这个页面里,小说人物的对话内容都是红色的,人物名称都是绿色的。
    在这里插入图片描述
python">from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bsObj = BeautifulSoup(html, "html.parser")  
# 不加"html.parser"参数时,会GuessedAtParserWarning提示,表明未明确指定解析器
nameList = bsObj.find_all('span', {'class': 'green'})
# 通过BeautifulSoup 对象,我们可以用find_all 函数抽取只包含在<span class="green"></span> 标签里的文字,这样就会得到一个人物名称的Python 列表
print(nameList)
#结果
"""
[<span class="green">Anna
Pavlovna Scherer</span>, <span class="green">Empress Marya
Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="green">the prince</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">Prince Vasili</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">Wintzingerode</span>, <span class="green">King of Prussia</span>, <span class="green">le Vicomte de Mortemart</span>, <span class="green">Montmorencys</span>, <span class="green">Rohans</span>, <span class="green">Abbe Morio</span>, <span class="green">the Emperor</span>, <span class="green">the prince</span>, <span class="green">Prince Vasili</span>, <span class="green">Dowager Empress Marya Fedorovna</span>, <span class="green">the baron</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the Empress</span>, <span class="green">the Empress</span>, <span class="green">Anna Pavlovna's</span>, <span class="green">Her Majesty</span>, <span class="green">Baron
Funke</span>, <span class="green">The prince</span>, <span class="green">Anna
Pavlovna</span>, <span class="green">the Empress</span>, <span class="green">The prince</span>, <span class="green">Anatole</span>, <span class="green">the prince</span>, <span class="green">The prince</span>, <span class="green">Anna
Pavlovna</span>, <span class="green">Anna Pavlovna</span>]
"""
for name in nameList:
    # .get_text() 会把你正在处理的HTML 文档中所有的标签都清除,然后返回一个只包含文字的字符串。假如你正在处理一个包含许多超链接、段落和标签的大段源代码,那么.get_text() 会把这些超链接、段落和标签都清除掉,只剩下一串不带标签的文字。
    print(name.getText())  # 或者是print(name.text)
#结果
"""
Anna Pavlovna Scherer
Empress Marya Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron Funke
The prince
Anna Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna Pavlovna
Anna Pavlovna
"""

1.2.1 BeautifulSoup的find()和find_all()函数

  • 通过BeautifulSoup的find()和find_all()函数可以通过标签的不同属性轻松地过滤HTML 页面,查找需要的标签组或单个标签;
  • 函数参数:
    find_all(tag, attributes, recursive, text, limit, keywords)
    find(tag, attributes, recursive, text, keywords)
python">#获得一个包含HTML 文档中所有标题标签的列表
bsObj.find_all({"h1","h2","h3","h4","h5","h6"})
#结果
"""
[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]
"""
#返回HTML 文档里红色与绿色两种颜色的span 标签:
bsObj.find_all("span", {"class":{"green", "red"}})
"""
[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>, <span class="green">Anna
Pavlovna Scherer</span>, <span class="green">Empress Marya
Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="red">If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.</span>, <span class="red">Heavens! what a virulent attack!</span>, <span class="green">the prince</span>, <span class="green">Anna Pavlovna</span>, <span class="red">First of all, dear friend, tell me how you are. Set your friend's
mind at rest,</span>, <span class="red">Can one be well while suffering morally? Can one be calm in times
like these if one has any feeling?</span>, <span class="green">Anna Pavlovna</span>, <span class="red">You are
staying the whole evening, I hope?</span>, <span class="red">And the fete at the English ambassador's? Today is Wednesday. I
must put in an appearance there,</span>, <span class="green">the prince</span>, <span class="red">My daughter is
coming for me to take me there.</span>, <span class="red">I thought today's fete had been canceled. I confess all these
festivities and fireworks are becoming wearisome.</span>, <span class="red">If they had known that you wished it, the entertainment would
have been put off,</span>, <span class="green">the prince</span>, <span class="red">Don't tease! Well, and what has been decided about Novosiltsev's
dispatch? You know everything.</span>, <span class="red">What can one say about it?</span>, <span class="green">the prince</span>, <span class="red">What has been decided? They have decided that
Buonaparte has burnt his boats, and I believe that we are ready to
burn ours.</span>, <span class="green">Prince Vasili</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="red">Oh, don't speak to me of Austria. Perhaps I don't understand
things, but Austria never has wished, and does not wish, for war.
She is betraying us! Russia alone must save Europe. Our gracious
sovereign recognizes his high vocation and will be true to it. That is
the one thing I have faith in! Our good and wonderful sovereign has to
perform the noblest role on earth, and he is so virtuous and noble
that God will not forsake him. He will fulfill his vocation and
crush the hydra of revolution, which has become more terrible than
ever in the person of this murderer and villain! We alone must
avenge the blood of the just one.... Whom, I ask you, can we rely
on?... England with her commercial spirit will not and cannot
understand the Emperor Alexander's loftiness of soul. She has
refused to evacuate Malta. She wanted to find, and still seeks, some
secret motive in our actions. What answer did Novosiltsev get? None.
The English have not understood and cannot understand the
self-abnegation of our Emperor who wants nothing for himself, but only
desires the good of mankind. And what have they promised? Nothing! And
what little they have promised they will not perform! Prussia has
always declared that Buonaparte is invincible, and that all Europe
is powerless before him.... And I don't believe a word that Hardenburg
says, or Haugwitz either. This famous Prussian neutrality is just a
trap. I have faith only in God and the lofty destiny of our adored
monarch. He will save Europe!</span>, <span class="red">I think,</span>, <span class="green">the prince</span>, <span class="red">that if you had been
sent instead of our dear <span class="green">Wintzingerode</span> you would have captured the
<span class="green">King of Prussia</span>'s consent by assault. You are so eloquent. Will you
give me a cup of tea?</span>, <span class="green">Wintzingerode</span>, <span class="green">King of Prussia</span>, <span class="red">In a moment. A propos,</span>, <span class="red">I am
expecting two very interesting men tonight, <span class="green">le Vicomte de Mortemart</span>,
who is connected with the <span class="green">Montmorencys</span> through the <span class="green">Rohans</span>, one of
the best French families. He is one of the genuine emigres, the good
ones. And also the <span class="green">Abbe Morio</span>. Do you know that profound thinker? He
has been received by <span class="green">the Emperor</span>. Had you heard?</span>, <span class="green">le Vicomte de Mortemart</span>, <span class="green">Montmorencys</span>, <span class="green">Rohans</span>, <span class="green">Abbe Morio</span>, <span class="green">the Emperor</span>, <span class="red">I shall be delighted to meet them,</span>, <span class="green">the prince</span>, <span class="red">But tell me,</span>, <span class="red">is it true that the Dowager Empress wants Baron Funke
to be appointed first secretary at Vienna? The baron by all accounts
is a poor creature.</span>, <span class="green">Prince Vasili</span>, <span class="green">Dowager Empress Marya Fedorovna</span>, <span class="green">the baron</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the Empress</span>, <span class="red">Baron Funke has been recommended to the Dowager Empress by her
sister,</span>, <span class="green">the Empress</span>, <span class="green">Anna Pavlovna's</span>, <span class="green">Her Majesty</span>, <span class="green">Baron
Funke</span>, <span class="green">The prince</span>, <span class="green">Anna
Pavlovna</span>, <span class="green">the Empress</span>, <span class="red">Now about your family. Do you know that since your daughter came
out everyone has been enraptured by her? They say she is amazingly
beautiful.</span>, <span class="green">The prince</span>, <span class="red">I often think,</span>, <span class="red">I often think how unfairly sometimes the
joys of life are distributed. Why has fate given you two such splendid
children? I don't speak of <span class="green">Anatole</span>, your youngest. I don't like
him,</span>, <span class="green">Anatole</span>, <span class="red">Two such charming children. And really you appreciate
them less than anyone, and so you don't deserve to have them.</span>, <span class="red">I can't help it,</span>, <span class="green">the prince</span>, <span class="red">Lavater would have said I
lack the bump of paternity.</span>, <span class="red">Don't joke; I mean to have a serious talk with you. Do you know I
am dissatisfied with your younger son? Between ourselves</span>, <span class="red">he was mentioned at Her
Majesty's and you were pitied....</span>, <span class="green">The prince</span>, <span class="red">What would you have me do?</span>, <span class="red">You know I did all
a father could for their education, and they have both turned out
fools. Hippolyte is at least a quiet fool, but Anatole is an active
one. That is the only difference between them.</span>, <span class="red">And why are children born to such men as you? If you were not a
father there would be nothing I could reproach you with,</span>, <span class="green">Anna
Pavlovna</span>, <span class="red">I am your faithful slave and to you alone I can confess that my
children are the bane of my life. It is the cross I have to bear. That
is how I explain it to myself. It can't be helped!</span>, <span class="green">Anna Pavlovna</span>]
"""
  • 递归参数recursive 是一个布尔变量。你想抓取HTML 文档标签结构里多少层的信息?如果recursive 设置为True,find_all 就会根据你的要求去查找标签参数的所有子标签,以及子标签的子标签。如果recursive 设置为False,find_all 就只查找文档的一级标签。find_all 默认是支持递归查找的(recursive 默认值是True);
  • 文本参数text 有点不同,它是用标签的文本内容去匹配,而不是用标签的属性。假如我们想查找前面网页中包含“the prince”内容的标签数量,我们可以把之前的find_all 方法换成下面的代码:
python">nameList = bsObj.find_all(text="the prince")
print(len(nameList))
#结果
"""
7
"""
  • 范围限制参数limit,显然只用于find_all 方法。find 其实等价于find_all 的limit 等于1 时的情形。如果你只对网页中获取的前x 项结果感兴趣,就可以设置它。但是要注意,这个参数设置之后,获得的前几项结果是按照网页上的顺序排序的,未必是你想要的那前几项。
  • 还有一个关键词参数keyword,可以让你选择那些具有指定属性的标签。例如:
python">allText = bsObj.find_all(id="text")
print(allText[0].get_text())
"""
"Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news."

It was in July, 1805, and the speaker was the well-known Anna
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
of high rank and importance, who was the first to arrive at her
reception. Anna Pavlovna had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg, used only by the elite.

All her invitations without exception, written in French, and
delivered by a scarlet-liveried footman that morning, ran as follows:

"If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer."

"Heavens! what a virulent attack!" replied the prince, not in the
least disconcerted by this reception. He had just entered, wearing
an embroidered court uniform, knee breeches, and shoes, and had
stars on his breast and a serene expression on his flat face. He spoke
in that refined French in which our grandfathers not only spoke but
thought, and with the gentle, patronizing intonation natural to a
man of importance who had grown old in society and at court. He went
up to Anna Pavlovna, kissed her hand, presenting to her his bald,
scented, and shining head, and complacently seated himself on the
sofa.

"First of all, dear friend, tell me how you are. Set your friend's
mind at rest," said he without altering his tone, beneath the
politeness and affected sympathy of which indifference and even
irony could be discerned.

"Can one be well while suffering morally? Can one be calm in times
like these if one has any feeling?" said Anna Pavlovna. "You are
staying the whole evening, I hope?"

"And the fete at the English ambassador's? Today is Wednesday. I
must put in an appearance there," said the prince. "My daughter is
coming for me to take me there."

"I thought today's fete had been canceled. I confess all these
festivities and fireworks are becoming wearisome."

"If they had known that you wished it, the entertainment would
have been put off," said the prince, who, like a wound-up clock, by
force of habit said things he did not even wish to be believed.

"Don't tease! Well, and what has been decided about Novosiltsev's
dispatch? You know everything."

"What can one say about it?" replied the prince in a cold,
listless tone. "What has been decided? They have decided that
Buonaparte has burnt his boats, and I believe that we are ready to
burn ours."

Prince Vasili always spoke languidly, like an actor repeating a
stale part. Anna Pavlovna Scherer on the contrary, despite her forty
years, overflowed with animation and impulsiveness. To be an
enthusiast had become her social vocation and, sometimes even when she
did not feel like it, she became enthusiastic in order not to
disappoint the expectations of those who knew her. The subdued smile
which, though it did not suit her faded features, always played
round her lips expressed, as in a spoiled child, a continual
consciousness of her charming defect, which she neither wished, nor
could, nor considered it necessary, to correct.

In the midst of a conversation on political matters Anna Pavlovna
burst out:

"Oh, don't speak to me of Austria. Perhaps I don't understand
things, but Austria never has wished, and does not wish, for war.
She is betraying us! Russia alone must save Europe. Our gracious
sovereign recognizes his high vocation and will be true to it. That is
the one thing I have faith in! Our good and wonderful sovereign has to
perform the noblest role on earth, and he is so virtuous and noble
that God will not forsake him. He will fulfill his vocation and
crush the hydra of revolution, which has become more terrible than
ever in the person of this murderer and villain! We alone must
avenge the blood of the just one.... Whom, I ask you, can we rely
on?... England with her commercial spirit will not and cannot
understand the Emperor Alexander's loftiness of soul. She has
refused to evacuate Malta. She wanted to find, and still seeks, some
secret motive in our actions. What answer did Novosiltsev get? None.
The English have not understood and cannot understand the
self-abnegation of our Emperor who wants nothing for himself, but only
desires the good of mankind. And what have they promised? Nothing! And
what little they have promised they will not perform! Prussia has
always declared that Buonaparte is invincible, and that all Europe
is powerless before him.... And I don't believe a word that Hardenburg
says, or Haugwitz either. This famous Prussian neutrality is just a
trap. I have faith only in God and the lofty destiny of our adored
monarch. He will save Europe!"

She suddenly paused, smiling at her own impetuosity.

"I think," said the prince with a smile, "that if you had been
sent instead of our dear Wintzingerode you would have captured the
King of Prussia's consent by assault. You are so eloquent. Will you
give me a cup of tea?"

"In a moment. A propos," she added, becoming calm again, "I am
expecting two very interesting men tonight, le Vicomte de Mortemart,
who is connected with the Montmorencys through the Rohans, one of
the best French families. He is one of the genuine emigres, the good
ones. And also the Abbe Morio. Do you know that profound thinker? He
has been received by the Emperor. Had you heard?"

"I shall be delighted to meet them," said the prince. "But tell me,"
he added with studied carelessness as if it had only just occurred
to him, though the question he was about to ask was the chief motive
of his visit, "is it true that the Dowager Empress wants Baron Funke
to be appointed first secretary at Vienna? The baron by all accounts
is a poor creature."

Prince Vasili wished to obtain this post for his son, but others
were trying through the Dowager Empress Marya Fedorovna to secure it
for the baron.

Anna Pavlovna almost closed her eyes to indicate that neither she
nor anyone else had a right to criticize what the Empress desired or
was pleased with.

"Baron Funke has been recommended to the Dowager Empress by her
sister," was all she said, in a dry and mournful tone.

As she named the Empress, Anna Pavlovna's face suddenly assumed an
expression of profound and sincere devotion and respect mingled with
sadness, and this occurred every time she mentioned her illustrious
patroness. She added that Her Majesty had deigned to show Baron
Funke, and again her face clouded over with sadness.

The prince was silent and looked indifferent. But, with the
womanly and courtierlike quickness and tact habitual to her, Anna
Pavlovna wished both to rebuke him (for daring to speak he had done of
a man recommended to the Empress) and at the same time to console him,
so she said:

"Now about your family. Do you know that since your daughter came
out everyone has been enraptured by her? They say she is amazingly
beautiful."

The prince bowed to signify his respect and gratitude.

"I often think," she continued after a short pause, drawing nearer
to the prince and smiling amiably at him as if to show that
political and social topics were ended and the time had come for
intimate conversation- "I often think how unfairly sometimes the
joys of life are distributed. Why has fate given you two such splendid
children? I don't speak of Anatole, your youngest. I don't like
him," she added in a tone admitting of no rejoinder and raising her
eyebrows. "Two such charming children. And really you appreciate
them less than anyone, and so you don't deserve to have them."

And she smiled her ecstatic smile.

"I can't help it," said the prince. "Lavater would have said I
lack the bump of paternity."

"Don't joke; I mean to have a serious talk with you. Do you know I
am dissatisfied with your younger son? Between ourselves" (and her
face assumed its melancholy expression), "he was mentioned at Her
Majesty's and you were pitied...."

The prince answered nothing, but she looked at him significantly,
awaiting a reply. He frowned.

"What would you have me do?" he said at last. "You know I did all
a father could for their education, and they have both turned out
fools. Hippolyte is at least a quiet fool, but Anatole is an active
one. That is the only difference between them." He said this smiling
in a way more natural and animated than usual, so that the wrinkles
round his mouth very clearly revealed something unexpectedly coarse
and unpleasant.

"And why are children born to such men as you? If you were not a
father there would be nothing I could reproach you with," said Anna
Pavlovna, looking up pensively.

"I am your faithful slave and to you alone I can confess that my
children are the bane of my life. It is the cross I have to bear. That
is how I explain it to myself. It can't be helped!"

He said no more, but expressed his resignation to cruel fate by a
gesture. Anna Pavlovna meditated.
"""
  • 关键词参数的注意事项
    虽然关键词参数keyword 在一些场景中很有用,但是,它是BeautifulSoup 在技术上做的一个冗余功能。任何用关键词参数能够完成的任务,同样可以用本章后面将介绍的技术解决。
    例如,下面两行代码是完全一样的:
python">bsObj.find_all(id="text")
bsObj.findA_all("", {"id":"text"})

另外,用keyword 偶尔会出现问题,尤其是在用class 属性查找标签的时候,因为class 是Python 中受保护的关键字。也就是说,class 是Python 语言的保留字,在Python 程序里是不能当作变量或参数名使用,假如你运行下面的代码,Python 就会因为你误用class 保留字而产生一个语法错误:

python">bsObj.find_all(class="green")

不过,你可以用BeautifulSoup 提供的方案,在class 后面增加一个下划线解决此报错:

python">bsObj.find_all(class_="green")

另外,你也可以用属性参数把class 用引号包起来:

python">bsObj.find_all("", {"class":"green"})
  • 标签Tag 对象
    BeautifulSoup 对象通过find 和find_all,或者直接调用子标签获取的一列对象或单个对象,就像:
python">bsObj.div.h1

1.2.2 获取标签的子标签、兄弟标签、父标签

在这里插入图片描述

  • HTML页面可以映射成一棵树,如下所示:
    在这里插入图片描述

1.2.2.1 子标签和其他后代标签

  • child()函数:用于筛选对象的子标签;
  • tr 标签是tabel 标签的子标签,而tr、th、td、img 和span标签都是tabel 标签的后代标签。所有的子标签都是后代标签,但不是所有的后代标签都是子标签。
  • 爬取id为giftList的孩子标签:
    在这里插入图片描述
python">from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.pythonscraping.com/pages/page3.html')
bsObj = BeautifulSoup(html, 'html.parser')
print('-----child nodes-----')
for child in bsObj.find('table', {'id': 'giftList'}).children:
    print(child)
"""
-----child nodes-----


<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>


<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
"""
  • descendants()函数:用于筛选步骤的后代标签;
  • 爬取id为giftList的后代标签:会把该标签下包含的所有标签爬取出来
python">print('-----descendants nodes-----')
for descendant in bsObj.find('table', {'id': 'giftList'}).descendants:
    print(descendant)
"""
-----descendants nodes-----
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<th>
Item Title
</th>

Item Title

<th>
Description
</th>

Description

<th>
Cost
</th>

Cost

<th>
Image
</th>

Image



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<td>
Vegetable Basket
</td>

Vegetable Basket

<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td>

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!

<span class="excitingNote">Now with super-colorful bell peppers!</span>
Now with super-colorful bell peppers!


<td>
$15.00
</td>

$15.00

<td>
<img src="../img/gifts/img1.jpg"/>
</td>


<img src="../img/gifts/img1.jpg"/>




<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
<td>
Russian Nesting Dolls
</td>

Russian Nesting Dolls

<td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td>

Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! 
<span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
8 entire dolls per set! Octuple the presents!


<td>
$10,000.52
</td>

$10,000.52

<td>
<img src="../img/gifts/img2.jpg"/>
</td>


<img src="../img/gifts/img2.jpg"/>




<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
<td>
Fish Painting
</td>

Fish Painting

<td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td>

If something seems fishy about this painting, it's because it's a fish! 
<span class="excitingNote">Also hand-painted by trained monkeys!</span>
Also hand-painted by trained monkeys!


<td>
$10,005.00
</td>

$10,005.00

<td>
<img src="../img/gifts/img3.jpg"/>
</td>


<img src="../img/gifts/img3.jpg"/>




<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
<td>
Dead Parrot
</td>

Dead Parrot

<td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td>

This is an ex-parrot! 
<span class="excitingNote">Or maybe he's only resting?</span>
Or maybe he's only resting?


<td>
$0.50
</td>

$0.50

<td>
<img src="../img/gifts/img4.jpg"/>
</td>


<img src="../img/gifts/img4.jpg"/>




<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
<td>
Mystery Box
</td>

Mystery Box

<td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td>

If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. 
<span class="excitingNote">Keep your friends guessing!</span>
Keep your friends guessing!


<td>
$1.50
</td>

$1.50

<td>
<img src="../img/gifts/img6.jpg"/>
</td>


<img src="../img/gifts/img6.jpg"/>
"""

1.2.2.2 兄弟标签

  • 对象不能把自己作为兄弟标签。任何时候你获取一个标签的兄弟标签,都不会包含这个标签本身。其次,这个函数只调用后面的兄弟标签。例如,如果我们选择一组标签中位于中间位置的一个标签,然后用next_siblings() 函数,那么它就只会返回在它后面的兄弟标签。
  • next_siblings()函数:返回对象之后的一组兄弟标签;
  • previous_siblings()函数: 返回对象之前的一组兄弟标签;
  • next_sibling 和previous_sibling 函数,与next_siblings 和previous_siblings的作用类似,只是它们返回的是单个标签,而不是一组标签。
  • 爬取id为giftList的孩子标签tr的兄弟标签:
    在这里插入图片描述
python">print('-----sibling nodes-----')
for sibling in bsObj.find('table', {'id': 'giftList'}).tr.next_siblings:
    print(sibling)
"""
-----sibling nodes-----


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>


<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
"""

1.2.2.3 父标签

  • parents()函数:返回对象父标签的一组标签;
  • parent()函数:返回对象父标签的一个标签;
    在这里插入图片描述
python">print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text()) #或者.text也可以获取到$15.00
"""
$15.00
"""

1.3 正则表达式和BeautifulSoup

正则表达式参见:https://editor.csdn.net/md/?articleId=115872907

  • 待抓取的网页是http://www.pythonscraping.com/pages/page3.html
    注意观察网页上有几个商品图片——它们的源代码形式如下:
    <img src="../img/gifts/img3.jpg">
    如果我们想抓取所有图片的URL 链接,非常直接的做法就是用find_all(“img”) 抓取所有图片,对吗?但是,有个问题。除了那些明显“多余的”图片(比如,LOGO)之外,新式的网站里都有一些隐藏图片,用于网页布局留白和元素对齐的空白图片,以及一些不容易察觉到的图片标签。总之,你不能仅用商品图片来统计网页上所有的图片。
  • 而且网页的布局也可能会变化,或者,因为某些原因,我们不想通过图片在网页中的位置来查找标签。那么当你想抓取随机分布在网站里的某个元素或数据时,就会出现问题。例如,一些网页的最上面可能有一张商品图片,但是在另一些网页上没有。解决这类问题的办法,就是直接定位那些标签来查找信息。
python">import re
from bs4 import BeautifulSoup
from urllib.request import urlopen

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, 'html.parser')
images = bsObj.find_all('img', {'src': re.compile('\.\.\/img\/gifts\/img.*\.jpg')})
for image in images:
    print(image['src'])
"""
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
"""

1.4 获取属性

  • 数据采集时你经常不需要查找标签的内容,而是需要查找标签属性。比如标签<a>指向的URL 链接包含在href 属性中,或者<img>标签的图片文件包含在src 属性中,这时获取标签属性就变得非常有用了。对于一个标签对象,可以用下面的代码获取它的全部属性:myTag.attrs要注意这行代码返回的是一个Python 字典对象,可以获取和操作这些属性。比如要获取图
    片的资源位置src,可以用下面这行代码:
python"> myImgTag.attrs["src"]

1.5 lambda函数应用

  • Lambda 表达式本质上就是一个函数,可以作为其他函数的变量使用;也就是说,一个函数不是定义成f(x, y),而是定义成f(g(x), y),或f(g(x), h(x)) 的形式。
  • BeautifulSoup 允许我们把特定函数类型当作find_all 函数的参数。唯一的限制条件是这些函数必须把一个标签作为参数且返回结果是布尔类型。BeautifulSoup 用这个函数来评估它遇到的每个标签对象,最后把评估结果为“真”的标签保留,把其他标签剔除。
  • 例如,下面的代码就是获取有两个属性的标签:
python">soup.find_all(lambda tag: len(tag.attrs) == 2)
#结果
#这行代码会找出下面的标签:
<div class="body" id="content"></div>
<span style="color:red" class="title"></span>

1.6 爬取页面文件并下载到本地

  • urllib.request.urlretrieve 可以根据文件的URL 下载文件到本地;
    在这里插入图片描述
python">url = "http://attack.mitre.org/"
html = urlopen(url)
bsObj = BeautifulSoup(html, 'html.parser')
imageLocation = bsObj.find("div", {"class": "py-1"}).find("img")["src"]
print(imageLocation)
urlretrieve(url + imageLocation[1:], "logo.png") #[1:]为了去除一个/
"""
#imageLocation路径为
/theme/images/ATT&CK_red.png
"""

在这里插入图片描述


http://www.niftyadmin.cn/n/1651.html

相关文章

相机的一些基本概念 曝光/f值/焦距/光圈/景深

1.对于某个固定的镜头来说 焦距 焦点都是固定的 2.对于变焦相机来说&#xff0c;是一组镜头&#xff0c;这一组镜头是可以组成可变焦的镜头组 3.近景深(前景深)是同一概念&#xff0c;远景深和后景深是一个概念 凸透镜成像规律_百度百科凸透镜成像规律是一种光学定律。在光学…

ffmpeg-codec函数调用流程分析

文章最后是ffmpeg解码的一个案例&#xff0c;我们先从把Codec所有核心函数列出来&#xff1a; const AVCodec *dec NULL; AVCodecContext *dec_ctx dec avcodec_find_decoder_by_name("libx264");//avcodec_find_decoder dec_ctx avcodec_alloc_context3(dec); a…

【C语言刷LeetCode】378. 有序矩阵中第 K 小的元素(M)

【 给你一个 n x n 矩阵 matrix &#xff0c;其中每行和每列元素均按升序排序&#xff0c;找到矩阵中第 k 小的元素。 请注意&#xff0c;它是 排序后 的第 k 小元素&#xff0c;而不是第 k 个 不同 的元素。 你必须找到一个内存复杂度优于 O(n2) 的解决方案。 示例 1&#x…

Excel多条件计数——COUNTIFS【获奖情况统计】

问题描述 当前&#xff0c;我们需要对表格中的获奖情况进行统计 奖励级别&#xff1a;院级、校级、国家级、国际级奖励内容&#xff1a;特等奖、一等奖、二等奖、三等奖、优胜奖 功能要求 对所有奖励级别进行统计根据级别&#xff0c;计算内容数量 当有人的选项内容如下时 …

基于微信小程序的健身私教预约系统

目 录 摘 要 I Abstract II 1绪论 5 1.1选题背景及意义 1 1.2研究现状 1 1.3发展动态 2 1.4研究主要内容 3 2系统分析 4 2.1可行性分析 4 2.2 系统需求分析 4 2.2.1 设计思想 5 2.2.2 功能需求 5 2.3 开发环境与运行环境设计 5 2.3.1 开发环境 5 2.3.2 运行环境 12 3系统设计 1…

2022年中国消费金融行业数字化技术创新分析

易观分析&#xff1a;消费金融公司是指经银保监会批准&#xff0c;在中国境内设立的&#xff0c;不吸收公众存款&#xff0c;以小额、分散为原则&#xff0c;为中国境内居民个人提供以消费为目的的贷款的非银行金融机构。单看消费金融公司的定义&#xff0c;很难联想到消费金融…

一文2000字手把手教你自动化测试平台建设分享

上期为大家介绍了自动化测试的基本概念&#xff0c;方便大家对于自动化测试建立基础性的认识。随着今年自动化测试建设项目的落地&#xff0c;我行的自动化测试平台也应运而生&#xff0c;为我行在自动化测试领域的实践探索提供了工具支撑&#xff0c;下面我就为大家分享平台建…

Postgresql实验系列(2)批量获取事务ID

1 背景 本文通过简单修改开源Postgresql源码&#xff0c;实现批量获取事务ID的功能&#xff0c;对比前后性能差异。 周末实验项目for fun&#xff0c;代码可以随意使用。 &#xff01;&#xff01;&#xff01;注意&#xff1a;修改会带来的并发问题会造成数据不一致&#xf…