前言
最近在一些网上采集数据,目前大部分网站的数据都是动态获取的,例如最常见的通过下拉滚动条刷新列表数据。这就让传统的Scrapy爬虫工具无能为力了,虽然有Selemium, Playwright等插件对Scrapy的加持,但这些插件目前都不太完善,所以,我基本放弃了一切用Scrapy搞掂的想法。静态网站的采集使用Scrapy,动态网站则完全采用Selenium或puppeteer或Playwright。
经过对三者的试用,基本结论就是:
大部分动态数据的爬取采用Selenium的完全没问题的,Selenium对Python和其它语言的支持胜于puppeteer和Playwright,Selenium总体要成熟稳定一点。文档也更丰富。网上各种问题容易找到答案。
当然上述结论或许半年后就不成立了。因为后两者的发展很快。言归正传,掌握了Selenium的选择器,就掌握了Selenium的一半。尤其是CSS 选择器,简明好用,是首选。
几种典型的选择方式
ID选择
如果元素有ID,则优先采用ID定位。
XPath: //div[@id='example']
CSS: #example
根据Element类型选择
Xpath: //input
Css: =input
直接子元素
XPATH 采用斜杠 “/“定义, CSS选择器采用 “>”定义。
例子:
XPath: //div/a
CSS: div > a
非直接子元素
XPATH采用双斜杠 “//”,CSS采用空格。例子:
XPath: //div//a
CSS: div a
根据class类名选择
XPATH: “[@class=‘example’]”
CSS 选择器就是一个点号“.”
XPath: //div[@class='example']
CSS: .example
根据元素的文本选择
XPATH: //[ text() = ‘Get started free’ ]
XPATH: //[ contains (text(), ‘Get started’ ) ]
CSS: <:><(text)>
CSS 选择器高级用法
Next Sibling 兄弟节点
This is useful for navigating lists of elements, such as forms or ul items. The next sibling will tell selenium to find the next adjacent element on the page that’s inside the same parent. Let’s show an example using a form to select the field after username.
Login
Let’s write an XPath and css selector that will choose the input field after “username”. This will select the “alias” input, or will select a different element if the form is reordered.XPATH: //input[@id=‘username’]/following-sibling:input[1]
CSS: #username + input
Attribute Values
If you don’t care about the ordering of child elements, you can use an attribute selector in selenium to choose elements based on any attribute value. A good example would be choosing the ‘username’ element of the form above without adding a class.
We can easily select the username element without adding a class or an id to the element.
XPATH: //input[@name=‘username’]
CSS: input[name=‘username’]
We can even chain filters to be more specific with our selectors.
XPATH: //input[@name='login’and @type=‘submit’]
CSS: input[name=‘login’][type=‘submit’]
Here Selenium will act on the input field with name=“login” and type=“submit”
指定特殊匹配: nth-child 和 nth-of-type
CSS selectors in Selenium allow us to navigate lists with more finesse than the above methods. If we have a ul and we want to select its fourth li element without regard to any other elements, we should use nth-child or nth-of-type. Nth-child is a pseudo-class. In straight CSS, that allows you to override behavior of certain elements; we can also use it to select those elements.
<ul id = "recordlist">
<li>Cat</li>
<li>Dog</li>
<li>Car</li>
<li>Goat</li>
</ul>
If we want to select the fourth li element (Goat) in this list, we can use the nth-of-type, which will find the fourth li in the list. Notice the two colons, a recent change to how CSS identifies pseudo-classes.
CSS: #recordlist li::nth-of-type(4)
On the other hand, if we want to get the fourth element only if it is a li element, we can use a filtered nth-child which will select (Car) in this case.
CSS: #recordlist li::nth-child(4)
Note, if you don’t specify a child type for nth-child it will allow you to select the fourth child without regard to type. This may be useful in testing css layout in selenium.
CSS: #recordlist *::nth-child(4)
In XPATH this would be similar to using [4].
子串匹配
CSS 选择器的一大特色就是字符串的匹配, 可以采用 ^=, $=, 或 *= 。
^= 匹配前缀
CSS: a[id^=‘id_prefix_’]
A link with an “id” that starts with the text “id_prefix_”
=
匹
配
后
缀
C
S
S
:
a
[
i
d
= 匹配后缀 CSS: a[id
=匹配后缀CSS:a[id=‘_id_sufix’]
A link with an “id” that ends with the text “_id_sufix”
= 匹配子串
CSS: a[id=‘id_pattern’]
A link with an “id” that contains the text “id_pattern”
总结
本文对selenium选择器的基本用法做了一一介绍,帮助大家掌握这一强大的自动化测试工具。当然,你搞自动化运维,做爬虫,没人拦着你。