Selenium选择器小结

前言

最近在一些网上采集数据，目前大部分网站的数据都是动态获取的，例如最常见的通过下拉滚动条刷新列表数据。这就让传统的Scrapy爬虫工具无能为力了，虽然有Selemium, Playwright等插件对Scrapy的加持，但这些插件目前都不太完善，所以，我基本放弃了一切用Scrapy搞掂的想法。静态网站的采集使用Scrapy，动态网站则完全采用Selenium或puppeteer或Playwright。
经过对三者的试用，基本结论就是:

大部分动态数据的爬取采用Selenium的完全没问题的，Selenium对Python和其它语言的支持胜于puppeteer和Playwright，Selenium总体要成熟稳定一点。文档也更丰富。网上各种问题容易找到答案。

当然上述结论或许半年后就不成立了。因为后两者的发展很快。言归正传，掌握了Selenium的选择器，就掌握了Selenium的一半。尤其是CSS 选择器，简明好用，是首选。

几种典型的选择方式

ID选择

如果元素有ID，则优先采用ID定位。

XPath: //div[@id='example'] 
CSS: #example

根据Element类型选择

Xpath: //input
Css: =input

直接子元素

XPATH 采用斜杠 “/“定义, CSS选择器采用 “>”定义。

例子:

XPath: //div/a
CSS: div > a

非直接子元素

XPATH采用双斜杠 “//”，CSS采用空格。例子:

XPath: //div//a
CSS: div a

根据class类名选择

XPATH: “[@class=‘example’]”
CSS 选择器就是一个点号“.”

XPath: //div[@class='example']
CSS: .example

根据元素的文本选择

XPATH: //[ text() = ‘Get started free’ ]
XPATH: //[ contains (text(), ‘Get started’ ) ]
CSS: <:><(text)>

CSS 选择器高级用法

Next Sibling 兄弟节点

This is useful for navigating lists of elements, such as forms or ul items. The next sibling will tell selenium to find the next adjacent element on the page that’s inside the same parent. Let’s show an example using a form to select the field after username.

Let’s write an XPath and css selector that will choose the input field after “username”. This will select the “alias” input, or will select a different element if the form is reordered.

XPATH: //input[@id=‘username’]/following-sibling:input[1]
CSS: #username + input
Attribute Values
If you don’t care about the ordering of child elements, you can use an attribute selector in selenium to choose elements based on any attribute value. A good example would be choosing the ‘username’ element of the form above without adding a class.

We can easily select the username element without adding a class or an id to the element.

XPATH: //input[@name=‘username’]
CSS: input[name=‘username’]
We can even chain filters to be more specific with our selectors.

XPATH: //input[@name='login’and @type=‘submit’]
CSS: input[name=‘login’][type=‘submit’]
Here Selenium will act on the input field with name=“login” and type=“submit”

指定特殊匹配: nth-child 和 nth-of-type

CSS selectors in Selenium allow us to navigate lists with more finesse than the above methods. If we have a ul and we want to select its fourth li element without regard to any other elements, we should use nth-child or nth-of-type. Nth-child is a pseudo-class. In straight CSS, that allows you to override behavior of certain elements; we can also use it to select those elements.

<ul id = "recordlist">
<li>Cat</li>
<li>Dog</li>
<li>Car</li>
<li>Goat</li>
</ul>

If we want to select the fourth li element (Goat) in this list, we can use the nth-of-type, which will find the fourth li in the list. Notice the two colons, a recent change to how CSS identifies pseudo-classes.

CSS: #recordlist li::nth-of-type(4)
On the other hand, if we want to get the fourth element only if it is a li element, we can use a filtered nth-child which will select (Car) in this case.

CSS: #recordlist li::nth-child(4)
Note, if you don’t specify a child type for nth-child it will allow you to select the fourth child without regard to type. This may be useful in testing css layout in selenium.

CSS: #recordlist *::nth-child(4)
In XPATH this would be similar to using [4].