---恢复内容开始---
在浏览这个网站(http://blog.jobbole.com/29281/)时,发现电子书不错。
就想download下来,也正好在学习爬虫,以下就用lxml及cssselect的方式下载下来,也当是个小练习。
1.download函数
import lxml.html def download(url,user_agent='wswp',num_retires=2): print 'Downloading:' ,url headers = {'User-agent': user_agent} request = urllib2.Request(url,headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print "Downloading error:", e.reason html = None if num_retires>0: if hasattr(e,'code') and 500<= e.code <600: return download(url, user_agent,num_retires-1) return html
2.抓取数据(注意加粗的cssselect的使用)
if __name__ == "__main__": url = 'http://blog.jobbole.com/29281/' html = download(url) for i in itertools.count(1): tree = lxml.html.fromstring(html) try: td = tree.cssselect('ol > li > a')[i] book = td.text_content() href = td.get('href') print book,href except: break
数据抓取完毕。