python中使用lxml与cssselect爬取电子书及链接

python中使用lxml与cssselect爬取电子书及链接

news/2024/7/19 12:10:07 标签: python, 爬虫

---恢复内容开始---

在浏览这个网站（http://blog.jobbole.com/29281/）时，发现电子书不错。

就想download下来，也正好在学习爬虫，以下就用lxml及cssselect的方式下载下来，也当是个小练习。

1.download函数

import lxml.html

def download(url,user_agent='wswp',num_retires=2):
    print 'Downloading:' ,url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url,headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print "Downloading error:", e.reason
        html = None
        if num_retires>0:
            if hasattr(e,'code') and 500<= e.code <600:
                return download(url, user_agent,num_retires-1)
    return html

2.抓取数据（注意加粗的cssselect的使用）

if __name__ == "__main__":
    url = 'http://blog.jobbole.com/29281/'
    html = download(url)
    for i in itertools.count(1):
        tree = lxml.html.fromstring(html)
        try:
            td = tree.cssselect('ol > li > a')[i]
            book = td.text_content()
            href = td.get('href')
            print book,href
        except:
            break

数据抓取完毕。

转载于:https://www.cnblogs.com/bigbrother/p/6545883.html

http://www.niftyadmin.cn/n/575965.html

相关文章

Java日期时间使用总结

Java日期时间使用总结

一、Java中的日期概述日期在Java中是一块非常复杂的内容，对于一个日期在不同的语言国别环境中，日期的国际化，日期和时间之间的转换，日期的加减运算，日期的展示格式都是非常复杂的问题。在Java中，操作日期主…

阅读更多...

CMIS 内容管理互操作服务技术委员会成形

CMIS 内容管理互操作服务技术委员会成形

分享一下我老师大神的人工智能教程！零基础，通俗易懂！http://blog.csdn.net/jiangjunshow也欢迎大家转载本篇文章。分享知识，造福人民，实现我们中华民族伟大复兴！source: http://xml.coverpages.org/cmis.ht…

阅读更多...

02:不吉利日期

02:不吉利日期

总时间限制: 1000ms内存限制: 65536kB描述在国外，每月的13号和每周的星期5都是不吉利的。特别是当13号那天恰好是星期5时，更不吉利。已知某年的一月一日是星期w，并且这一年一定不是闰年，求出这一年所有13号那天是星期5的月份&…

阅读更多...

20170315_hiberbnate各个版本下载

20170315_hiberbnate各个版本下载

1：打开浏览器，打开google，百度，360搜索，输入hibernate。进入官网，下面红色框中的就是官网： 2：进入官网后，左边有个download链接： 3：点进去后页面上…

阅读更多...

mybatis的插件，挺好支持下

mybatis的插件，挺好支持下

利用 Mybatis-generator自动生成代码http://www.cnblogs.com/yjmyzz/p/4210554.htmlMybatis 通用 Mapper3 https://github.com/abel533/MapperMybatis 分页插件 PageHelper https://github.com/pagehelper/Mybatis-PageHelper转载于:https://www.cnblogs.com/suneryong/p/65598…

阅读更多...

Mac 下如何使用sed -i命令

Mac 下如何使用sed -i命令

今天在学习Linux的过程中发现了sed这一项指令首先，sed的全称是:Stream Editor 调用sed命令有两种形式： sed [options] command file(s) sed [options] -f scriptfile file(s) 今天就主要说一下sed命令里面-i这个参数的用法 -i 是指在当前文本进行更改具…

阅读更多...

2017.3.17上午

2017.3.17上午

跟着老师一起看视频学习数据链路层以及循环冗余检验(CRC)以及FCS 转载于:https://www.cnblogs.com/bgd140206206/p/6565251.html

阅读更多...

Fastest Delaunay triangulation libraries for sets of 3D points

Fastest Delaunay triangulation libraries for sets of 3D points

http://scicomp.stackexchange.com/questions/2026/fastest-delaunay-triangulation-libraries-for-sets-of-3d-points 转载于:https://www.cnblogs.com/guochen/p/6567252.html

阅读更多...

最新文章