利用Python搜索51CTO推荐博客并保存至Excel

news/2024/7/19 9:31:34 标签: python, 爬虫, 数据库

一、背景

近期在学习爬虫,利用Requests模块获取页面,BeautifulSoup来获取需要的内容,最后利用xlsxwriter模块讲内容保存至excel,在此记录一下,后续可举一反三,利用其抓取其他内容持久和存储到文件内,或数据库等。

二、代码

编写了两个模块,geturl3和getexcel3,最后在main内调用

geturl3.py代码内容如下:

#!/bin/env python
# -*- coding:utf-8 -*-
# @Author  : kaliarch

import requests
from bs4 import BeautifulSoup

class get_urldic:
    #获取搜索关键字
    def get_url(self):
        urlList = []
        first_url = 'http://blog.51cto.com/search/result?q='
        after_url = '&type=&page='
        try:
            search = input("Please input search name:")
            page = int(input("Please input page:"))
        except Exception as e:
            print('Input error:',e)
            exit()
        for num in range(1,page+1):
            url = first_url + search + after_url + str(num)
            urlList.append(url)
        print("Please wait....")
        return urlList,search

    #获取网页文件
    def get_html(self,urlList):
        response_list = []
        for r_num in urlList:
            request = requests.get(r_num)
            response = request.content
            response_list.append(response)
        return response_list

    #获取blog_name和blog_url
    def get_soup(self,html_doc):
        result = {}
        for g_num in html_doc:
            soup = BeautifulSoup(g_num,'html.parser')
            context = soup.find_all('a',class_='m-1-4 fl')
            for i in context:
                title=i.get_text()
                result[title.strip()]=i['href']
        return result

if __name__ == '__main__':
    blog = get_urldic()
    urllist, search = blog.get_url()
    html_doc = blog.get_html(urllist)
    result = blog.get_soup(html_doc)
    for k,v in result.items():
        print('search blog_name is:%s,blog_url is:%s' % (k,v))

getexcel3.py代码内容如下:

#!/bin/env python
# -*- coding:utf-8 -*-
# @Author  : kaliarch

import xlsxwriter

class create_excle:
    def __init__(self):
        self.tag_list = ["blog_name", "blog_url"]

    def create_workbook(self,search=" "):
        excle_name = search + '.xlsx'
        #定义excle名称
        workbook = xlsxwriter.Workbook(excle_name)
        worksheet_M = workbook.add_worksheet(search)
        print('create %s....' % excle_name)
        return workbook,worksheet_M

    def col_row(self,worksheet):
        worksheet.set_column('A:A', 12)
        worksheet.set_row(0, 17)
        worksheet.set_column('A:A',58)
        worksheet.set_column('B:B', 58)

    def shell_format(self,workbook):
        #表头格式
        merge_format = workbook.add_format({
            'bold': 1,
            'border': 1,
            'align': 'center',
            'valign': 'vcenter',
            'fg_color': '#FAEBD7'
        })
        #标题格式
        name_format = workbook.add_format({
            'bold': 1,
            'border': 1,
            'align': 'center',
            'valign': 'vcenter',
            'fg_color': '#E0FFFF'
        })
        #正文格式
        normal_format = workbook.add_format({
            'align': 'center',
        })
        return merge_format,name_format,normal_format

    #写入title和列名
    def write_title(self,worksheet,search,merge_format):
        title = search + "搜索结果"
        worksheet.merge_range('A1:B1', title, merge_format)
        print('write title success')

    def write_tag(self,worksheet,name_format):
        tag_row = 1
        tag_col = 0
        for num in self.tag_list:
            worksheet.write(tag_row,tag_col,num,name_format)
            tag_col += 1
        print('write tag success')

    #写入内容
    def write_context(self,worksheet,con_dic,normal_format):
        row = 2
        for k,v in con_dic.items():
            if row > len(con_dic):
                break
            col = 0
            worksheet.write(row,col,k,normal_format)
            col+=1
            worksheet.write(row,col,v,normal_format)
            row+=1
        print('write context success')

    #关闭excel
    def workbook_close(self,workbook):
        workbook.close()

if __name__ == '__main__':
    print('This is create excel mode')

main.py代码内容如下:

#!/bin/env python
# -*- coding:utf-8 -*-
# @Author  : kaliarch

import geturl3
import getexcel3

#获取url字典
def get_dic():
    blog = geturl3.get_urldic()
    urllist, search = blog.get_url()
    html_doc = blog.get_html(urllist)
    result = blog.get_soup(html_doc)
    return result,search

#写入excle
def write_excle(urldic,search):
    excle = getexcel3.create_excle()
    workbook, worksheet = excle.create_workbook(search)
    excle.col_row(worksheet)
    merge_format, name_format, normal_format = excle.shell_format(workbook)
    excle.write_title(worksheet,search,merge_format)
    excle.write_tag(worksheet,name_format)
    excle.write_context(worksheet,urldic,normal_format)
    excle.workbook_close(workbook)

def main():
    url_dic ,search_name = get_dic()
    write_excle(url_dic,search_name)

if __name__ == '__main__':
    main()

三、效果展示

运行代码,填写搜索的关键字,及搜索多少页

查看会生成一个以搜索关键字命名的excel,打开写入的内容

利用其就可以搜索并保持自己需要的51CTO推荐博客,可以多搜索几个


http://www.niftyadmin.cn/n/1775093.html

相关文章

C++进行二次方程根的检验和根的计算,超简单

直接上代码 #include<bits/stdc.h> #include<cmath> #include<windows.h> #include<stdlib.h> #include<stdio.h> #include<string.h> using namespace std; long double q,w,e,r,t,u,o,p,s,d,f,g,h,j,l,z,c,v,b,n,m,i; long double k,x,…

Linux学习笔记——文件内容查看

在Linux下用命令查看一个文件&#xff0c;会遇到许多困难。如&#xff0c;当查看一个很大的文件时&#xff0c;而我们有只是需要知道它中间的几行&#xff0c;那该如何是好呢&#xff1f;下面就介绍几个实用的&#xff0c;有趣的命令&#xff01; cat&#xff1a;&#xff08;这…

Let's Encrypt 泛域名证书申请及配置

Lets Encrypt 在今年 3 月份就已经推出泛域名证书支持了&#xff0c;以前我一直是使用的单域名证书&#xff0c;加上站点开启了 HSTS 支持&#xff0c;当新增网站应用时不得不为其单独申请证书&#xff0c;十分不便。目前比较常用的为 Lets Encrypt 生成证书的工具比较多&#…

spring boot 集成mybatis

2019独角兽企业重金招聘Python工程师标准>>> pom.xml <?xml version"1.0" encoding"UTF-8"?> <project xmlns"http://maven.apache.org/POM/4.0.0"xmlns:xsi"http://www.w3.org/2001/XMLSchema-instance"xsi:sc…

Linux学习笔记——环境变量

之前在服务器上加了一个adb&#xff0c;修改了环境变量PATH&#xff08;在/.bashrc文件中加入的export PATH${PATH}:/bin/adb&#xff09;&#xff0c;这样就能使得每台终端在登录服务器后使用了。但是&#xff0c;今天却出了一点问题&#xff0c;abd不能用了&#xff0c;而且…

c-50 可变参数

double Arg(int num,...){ va_list va__s; double sum0; va_start(va__s, num); for (size_t i 0; i < num; i) { sum __crt_va_arg(va__s, int); } va_end(va__s); return sum / num;}转载于:https://www.cnblogs.com/sinianxinfei/p/9216193.html

学习JVM-运行时数据区

2019独角兽企业重金招聘Python工程师标准>>> 一背景&#xff1a; 作为java码农&#xff0c;对于常见的编码&#xff0c;编译&#xff0c;执行比较熟悉了。更加关注框架跟业务实现&#xff0c;但是回头想想&#xff0c;当我们执行java命令后究竟发生了什么&#xff0…

C语言中产生随机数

C语言中产生随机数 C语言中是怎样产生随机数的呢&#xff1f;这就要用到rand()函数和srand()函数啦&#xff01;那接下来就分几种情况看看在C语言中到底是怎么产生随机数的。 1、如果你只是要产生随机数&#xff0c;而不需要设定随机数的范围的话&#xff0c;这里你之需要用ran…