Scrapy实现爬取新浪微博用户信息(爬虫结果写入mongodb)

news/2024/7/19 12:37:49 标签: 微博, python, 爬虫

爬取字段有:

  1. 微博ID
  2. 微博昵称
  3. 性别
  4. 地区信息
  5. 认证信息
  6. 个性签名
  7. 发表微博个数
  8. 粉丝个数
  9. 关注个数

spiders文件夹下microID_Spider.py这样写:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from blogSpider.items import blogIDItem

class MicroidSpiderSpider(scrapy.Spider):
    name = 'microID_Spider'
    allowed_domains = ['weibo.cn']
    start_urls = ['https://weibo.cn/search']
    # 默认50页
    max_page = 50
    myCookie = 'xxxxxxx'
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': ' max-age=0',
        'Connection': ' keep-alive',
        'Content-Type': ' application/x-www-form-urlencoded',
        'Host': ' weibo.cn',
        'Origin': ' https://weibo.cn',
        'Upgrade-Insecure-Requests': ' 1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
    }

    def start_requests(self):
        cookie = {}
        for i in self.myCookie.split(';')[:-1]:
            cookie[i.split('=')[0]] = i.split('=')[1]

        blogID = '罗志祥'
        for i in range(1, self.max_page+1):
            url = '{url}/user/?keyword={blogID}&page={pageNum}'.format(url=self.start_urls[0], blogID=blogID, pageNum=i)
            yield scrapy.FormRequest(
                url,
                headers=self.headers,
                cookies=cookie,
                callback=self.write_BlogID,
            )

    def write_BlogID(self, response):
        pageInfo = Selector(response)
        # print(response.body.decode('utf-8'))
        # 所有用户信息都存放在HTML table标签中,所以获取页面中的所有table
        all_Table = pageInfo.xpath('//table')
        cookie = {}
        for i in self.myCookie.split(';')[:-1]:
            cookie[i.split('=')[0]] = i.split('=')[1]
        # print(all_Table)
        for table in all_Table:
            ID_href = table.css('a::attr(href)').extract()[0]
            # print(ID_href.split('?'))
            # print(type(ID_href))
            url = 'https://weibo.cn' + ID_href.split('?')[0]
            # print(url)
            yield scrapy.Request(
                url,
                headers=self.headers,
                cookies=cookie,
                callback=self.getBlogIDinfo,
            )


    def getBlogIDinfo(self, response):
        # print(response.body.decode('utf-8'))
        # print(response.url)
        blogID_Info = blogIDItem()
        # 微博ID存放在URL里,从URL提取就可以
        blogID_Info['ID'] = response.url.split('/')[len(response.url.split('/'))-1]
        pageInfo = Selector(response)
        ut_div = pageInfo.xpath('//div[@class="ut"]')
        spans = ut_div.xpath('span[@class="ctt"]/text()').extract()
        # print(len(spans))
        # print(spans)
        # 四种情况
        if len(spans) == 1:
            # 第一种,只有昵称、性别、地址信息,形如['羅誌祥\xa0男/台湾    \xa0    '],数组长度为1
            firstRowInfo = spans[0].split(u'\xa0')
            # 微博昵称
            blogID_Info['blogName'] = firstRowInfo[0].replace(u'\xa0', u' ')
            # 性别
            blogID_Info['sex'] = firstRowInfo[1][:firstRowInfo[1].index('/')]
            # 地址
            blogID_Info['location'] = firstRowInfo[1][firstRowInfo[1].index('/')+1:].strip(' ')
            # 认证信息为空
            blogID_Info['identification'] = ''
            # 个性签名为空
            blogID_Info['personal_sign'] = ''
        elif len(spans) == 2:
            firstRowInfo = spans[0].split(u'\xa0')
            blogID_Info['blogName'] = firstRowInfo[0].replace(u'\xa0', u' ')
            blogID_Info['sex'] = firstRowInfo[1][:firstRowInfo[1].index('/')]
            blogID_Info['location'] = firstRowInfo[1][firstRowInfo[1].index('/') + 1:].strip(' ')
            if spans[1].find('认证') == -1:
                # 认证信息为空
                blogID_Info['identification'] = ''
                # 个性签名不为空
                blogID_Info['personal_sign'] = spans[1].replace(u'\u301c', u' ')
            else:
                # 认证信息不为空
                blogID_Info['identification'] = spans[1].replace(u'\u301c', u' ')
                # 个性签名为空
                blogID_Info['personal_sign'] = ''
        elif len(spans) == 3:
            blogID_Info['blogName'] = spans[0].replace(u'\xa0', u' ')
            secondRowInfo = spans[1].split(u'\xa0')
            blogID_Info['sex'] = secondRowInfo[1][:secondRowInfo[1].index('/')]
            blogID_Info['location'] = secondRowInfo[1][secondRowInfo[1].index('/') + 1:].strip(' ')
            if spans[2].find('认证') == -1:
                # 认证信息为空
                blogID_Info['identification'] = ''
                # 个性签名不为空
                blogID_Info['personal_sign'] = spans[2].replace(u'\u301c', u' ')
            else:
                # 认证信息不为空
                blogID_Info['identification'] = spans[2].replace(u'\u301c', u' ')
                # 个性签名为空
                blogID_Info['personal_sign'] = ''
        elif len(spans) == 4:
            blogID_Info['blogName'] = spans[0].replace(u'\xa0', u' ')
            secondRowInfo = spans[1].split(u'\xa0')
            blogID_Info['sex'] = secondRowInfo[1][:secondRowInfo[1].index('/')]
            blogID_Info['location'] = secondRowInfo[1][secondRowInfo[1].index('/') + 1:].strip(' ')
            blogID_Info['identification'] = spans[2].replace(u'\u301c', u' ')
            blogID_Info['personal_sign'] = spans[3].replace(u'\u301c', u' ')
        # print(blogID_Info['blogName'])
        blogNumInfo = pageInfo.xpath('//span[@class="tc"]/text()').extract()
        # print(blogNumInfo)
        tip2 = pageInfo.xpath('//div[@class="tip2"]')
        focusInfo = tip2.xpath('a[1]/text()').extract()
        # print(focusInfo)
        fansInfo = tip2.xpath('a[2]/text()').extract()
        # print(fansInfo)
        blogID_Info['blog_Num'] = blogNumInfo[0][blogNumInfo[0].index('[')+1:blogNumInfo[0].index(']')]
        blogID_Info['focus_Num'] = focusInfo[0][focusInfo[0].index('[')+1:focusInfo[0].index(']')]
        blogID_Info['fans_Num'] = fansInfo[0][fansInfo[0].index('[')+1:fansInfo[0].index(']')]
        # print(blogID_Info)
        yield blogID_Info

items.py这样写:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class blogIDItem(scrapy.Item):
    '''
    微博用户信息
    '''
    collection = 'blogID_Data'
    ID = scrapy.Field()
    blogName = scrapy.Field()
    sex = scrapy.Field()
    location = scrapy.Field()
    identification = scrapy.Field()
    personal_sign = scrapy.Field()
    blog_Num = scrapy.Field()
    fans_Num = scrapy.Field()
    focus_Num = scrapy.Field()


http://www.niftyadmin.cn/n/1688989.html

相关文章

发布Scrapy项目到scrapyd

1. 安装scrapyd包文件 2. 启动scrapyd cmd黑屏终端输入scrapyd就可以启动 2. scrapy.cfg里修改以下内容 # Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.…

Django调用Scrapy爬虫实现异步爬虫(前端输入爬虫字段信息,后端执行爬虫过程)

1. 首先我们修改爬虫文件的init方法: 添加如下代码: def __init__(self, keyWordNone, startTimeNone, endTimeNone, *args, **kwargs):super(MicroblogspiderSpider, self).__init__(*args, **kwargs)self.keyWord keyWordself.startTime startTimese…

修改Tomcat中的server.xml文件,使得get请求中的字符编码为UTF-8

修改tomcat/conf/server.xml中 <Connector port"8080" protocol"HTTP/1.1"connectionTimeout"20000"redirectPort"8443" URIEncoding"UTF-8"/>增加 URIEncoding"UTF-8"这一行就可以

有效的jQuery和bootstrap链接

复制以下代码到HTML文件就可以引入jQuery和bootstrap链接 <!-- 最新版本的 Bootstrap 核心 CSS 文件 --><link rel"stylesheet" href"https://cdn.jsdelivr.net/npm/bootstrap3.3.7/dist/css/bootstrap.min.css" integrity"sha384-BVYiiSIF…

hadoop集群启动之后safe mode is on问题解决_2020-09-16

问题描述 当启动hadoop集群的时候&#xff0c;没有报错&#xff0c;进入hadoop:50070端口也正常&#xff0c;但是在Summary中&#xff0c;安全模式提示为on。不知为何。。。。 当启动hive的时候&#xff0c;会报错&#xff1a;namenode safemode is on 然后看其他博客说是因为…

Error:(3, 41) java: 程序包org.apache.kafka.clients.producer不存在 错误提示解决办法

场景&#xff1a;在Windows上编写Kafka中的main方法&#xff0c;向Linux系统中的Kafka消费者传递消息&#xff0c;执行main方法的时候提示这个错误&#xff1a; Error:(3, 41) java: 程序包org.apache.kafka.clients.producer不存在解决办法&#xff1a; 在Windows上cd到该mav…

Windows Python pip修改源

pip源有以下 清华&#xff1a;https://pypi.tuna.tsinghua.edu.cn/simple 阿里云&#xff1a;http://mirrors.aliyun.com/pypi/simple/ 中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/ 华中理工大学&#xff1a;http://pypi.hustunique.com/ 山东理工大学&#xff1a;…

echarts鼠标悬停tooltip显示内容的位置自适应

在echarts中的option中添加如下代码即可&#xff1a; tooltip:{formatter: function(obj) {let value obj.value;return <div style"border-bottom: 1px solid rgba(255,255,255,.3); font-size: 18px;padding-bottom: 7px;margin-bottom: 7px"> value[0] &l…