[爬虫资源]各大爬虫资源大汇总,做我们自己的awesome系列

news/2024/7/19 9:56:12 标签: 爬虫, php, c#

  大数据的流行一定程序导致的爬虫的流行,有些企业和公司本身不生产数据,那就只能从网上爬取数据,笔者关注相关的内容有一定的时间,也写过很多关于爬虫的系列,现在收集好的框架希望能为对爬虫有兴趣的人,或者想更进一步的研究的人提供索引,也随时欢迎大家star,fork ,或者提issue,让我们一起来完善这个awesome系列
github地址

Awesome-crawler Awesome

A collection of awesome web crawler,spider and resources in different language

Python

  • Scrapy - A fast high-level screen scraping and web crawling framework.
  • pyspider - A powerful spider system.
  • cola - A distributed crawling framework.
  • Demiurge - PyQuery-based scraping micro-framework.
  • feedparser - Universal feed parser.
  • Grab - Site scraping framework.
  • MechanicalSoup - A Python library for automating interaction with websites.
  • portia - Visual scraping for Scrapy.
  • crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
  • RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
  • MSpider - A simple ,easy spider using gevent and js render.

Java

  • Apache Nutch - Highly extensible, highly scalable web crawler for production environment.
  • Crawler4j - Simple and lightweight web crawler.
  • JSoup - Scrapes, parses, manipulates and cleans HTML.
  • websphinx - Website-Specific Processors for HTML INformation eXtraction.
  • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
  • Gecco - A easy to use lightweight web crawler
  • WebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
  • Webmagic - A scalable crawler framework.
  • Heritrix3 - Extensible, web-scale, archival-quality web crawler project.
  • SeimiCrawler - An agile, distributed crawler framework.

C

  • ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.
  • SimpleCrawler - Simple spider base on mutithreading, regluar expression.
  • Abot - C# web crawler built for speed and flexibility.
  • Hawk - Advanced Crawler and ETL tool written in C#/WPF.

JavaScript

  • simplecrawler - Event driven web crawler.
  • node-crawler - Node-crawler has clean,simple api.
  • js-crawler - Web crawler for Node.JS, both HTTP and HTTPS are supported.

php">PHP

  • Goutte - A screen scraping and web crawling library for PHP.
    • laravel-goutte - Laravel 5 Facade for Goutte.
  • dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.
  • pspider - Parallel web crawler written in PHP.
  • php-spider - A configurable and extensible PHP web spider.

C++

  • open-source-search-engine - A distributed open source search engine and spider/crawler written in C/C++.

Ruby

  • wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
  • RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.

Go

  • gocrawl - Polite, slim and concurrent web crawler.
  • fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

Scala

  • crawler - Scala DSL for web crawling.
  • scrala - Scala crawler(spider) framework, inspired by scrapy.
  • ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.

还在持续更新之中:最新的资源请查看git:https://github.com/BruceDone/awesome-crawler

转载于:https://www.cnblogs.com/codefish/p/5947165.html


http://www.niftyadmin.cn/n/997258.html

相关文章

图标库 vue_Feather Icon - 简单漂亮的免费开源图标库

一套面向设计师和开发者,功能性强、风格高度一致的免费开源图标库。介绍Feather 是一套面向设计师和开发者的开源图标库,是一个简单漂亮的开源图标库。 每个图标都设计在一个2424的网格上,强调简单,一致性和易读性。很多大厂出品的…

Apk通过安卓修改大师加固的内部逻辑揭秘

一、前 言 Android Apk加固的发展已经有一段时间了,相对来说本文要记录的Android加壳的实现思路是4年的东西了,已经被老鸟玩烂了,Android加固的安全厂商也不会采用这么粗犷的方式来进行Android Apk的加固处理。早期Android加固聚焦的两个点主…

虚拟主机的实现

Apache:虚拟主机(有3种)基于IPIP1:80IP2:80基于端口IP:80IP:8080基于域名IP:80主机名不同实现:(为避免影响,先关闭防火墙和selinux)一、同一端口,不同IP1.首先,保证有一个基于中心主…

Java基础学习总结(66)——配置管理库typesafe.config教程

Typesafe的Config库,纯Java写成、零外部依赖、代码精简、功能灵活、API友好。支持Java properties、JSON、JSON超集格式HOCON以及环境变量。它也是Akka的配置管理库. Overview 纯java实现,无任何依赖充分的测试支持: Java properties, JSON, and a human…

一键连接wifi_王者荣耀把自家WiFi改成金牌银牌特权,苹果用户获取WiFi名字便捷方法!...

(今日封面,长按保存) 每日听歌这篇文章主要介绍: 苹果 今天虽然是劳动节,但也要多多注意休息,保持身体健康,祝节日快乐。 删除昨天的烦恼;设置明天的幸福,存储永远的爱心,取消世间的仇恨,粘贴美丽的心情,复制醉人的风景,科技君祝你五一好心情,天天愉快。 前几天写的…

安卓修改大师揭秘Android手游破解全过程

由于安卓修改大师的零门槛学习成本,让安卓应用程序的破解和二次开发变得相当简单,也正因如此,手机APP遭受破解和盗版问题长期存在,且愈演愈烈。尤其是手游行业,如刀塔传奇、植物大战僵尸、2048等知名游戏被破解的案例不…

SMALI语法大全

安卓修改大师对安卓文件进行代码级别修改的时候,需要修改反编译生成的SMALI文件。本文是最全的SMALI语法,原始英文资料,保持文章的原汁原味,避免翻译造成的歧义,对于英文好的同学,这是不可多得的福利。 SM…

cookie的工作原理

2019独角兽企业重金招聘Python工程师标准>>> Cookie是进行网站用户身份,实现服务端Session会话持久化的一种非常好方式。Cookie最早由Netscape公司开发,现在由 IETF 的RFC 6265标准备对其规范,已被所有主流浏览器所支持。 1. 为什…