PHPCrawler抓取酷狗精选集歌单

news/2024/7/19 10:21:37 标签: PHP, 爬虫, 搜索引擎, spider, PHPCrawler

一、PHPCrawler的介绍与安装

先了解一下什么是抓取?
抓取就是网络爬虫,也就是人们常说的网络蜘蛛(spider)。是搜索引擎的一个重要组成部分,按照一定的逻辑和算法抓取和下载互联网上的信息和网页。一般的爬虫从一个start url开始,按照一定的策略开始爬取,把爬取到的新的url放入爬取队列中,然后进行新一轮的爬取,直到抓取完毕为止。
PHPCrawler是一个国外开源的爬虫系统,它的源码托管在sourceforge里,这是它的下载地址:点击打开链接
,根据自己电脑里安装的PHP版本选择合适的版本下载。下载完毕之后,解压到服务器网站根目录下,复制example.php文件,并重命名。

二、完整源码

<?php

// It may take a whils to crawl a site ...
set_time_limit(10000);

// Inculde the phpcrawl-mainclass
include("libs/PHPCrawler.class.php");

// Extend the class and override the handleDocumentInfo()-method 
class MyCrawler extends PHPCrawler 
{
  //在这里解析页面内容
  function handleDocumentInfo($DocInfo) 
  {
    // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";

    // Print the URL and the HTTP-status-Code
    echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;
    
    // Print the refering URL
    echo "Referer-page: ".$DocInfo->referer_url.$lb;
    
    // Print if the content of the document was be recieved or not
    if ($DocInfo->received == true)
      echo "Content received: ".$DocInfo->bytes_received." bytes".$lb;
    else
      echo "Content not received".$lb; 
    
    // Now you should do something with the content of the actual
    // received page or file ($DocInfo->source), we skip it in this example 
    //echo $DocInfo->source;
    //echo $lb;
    $url=$DocInfo->url;
    $pat="/http:\/\/www\.kugou\.com\/yy\/special\/single\/\d+\.html/";
    if(preg_match($pat,$url)>0){
      $this->parseSonglistDetail($DocInfo);
    }
    flush();
  } 

  public function parseSonglistDetail($DocInfo){
        
        $songlistArr=array();
        $songlistArr['raw_url']=$DocInfo->url;
        $content=$DocInfo->content;
        //名称
        $matches=array();
        $pat="/<span>名称:<\/span>([^(<br)]+)<br \/>/";
        $res=preg_match($pat, $content,$matches);
        if($res>0){
          $songlistArr['title']=$matches[1];
        }else{
          $songlistArr['title']="";
          print "error:get title fail<br/>";
        }
        //创建人
        $matches=array();
        $pat="/<span>创建人:<\/span>([^(<br)]+)<br \/>/";
        $res=preg_match($pat, $content,$matches);
        if($res>0){
          $songlistArr['creator']=$matches[1];
        }else{
          $songlistArr['creator']="";
          print "error:get creator fail<br/>";
        }
        //创建时间
        $matches=array();
        $pat="/<span>更新时间:<\/span>([^(<br)]+)<br \/>/";
        $res=preg_match($pat, $content,$matches);
        if($res>0){
          $songlistArr['create_date']=$matches[1];
        }else{
          $songlistArr['create_date']="";
          print "error:get create_date fail<br/>";
        }
        //简介
        $matches=array();
        $pat="/<span>简介:<\/span>([^(<\/p)]*)<\/p>/";
        $res=preg_match($pat, $content,$matches);
        if($res>0){
          $songlistArr['info']=$matches[1];
        }else{
          $songlistArr['info']="";
          print "error:get info fail<br/>";
        }
        //歌曲
        $matches=array();
        $pat="/<a title=\"([^\"]+)\" hidefocus=\"/";
        $res=preg_match_all($pat, $content,$matches);
        
        if($res>0){
          $songlistArr['songs']=array();
          for($i=0;$i<count($matches[1]);$i++){
            $song_title=$matches[1][$i];
            array_push($songlistArr['songs'],array('title'=>$song_title));
          }
        }else{
          $songlistArr['song']="";
          print "error:get song fail<br/>";
        }
        
        echo "<pre>";
        print_r($songlistArr);
        echo "</pre>";
        $this->saveSonglist($songlistArr);
  }


  public function saveSonglist($songlistArr){
    //连接数据库
    $conn=mysql_connect("localhost","root","root");
    mysql_select_db("songlist",$conn);
    mysql_query("set names utf8");
    $songlist=array();
    $songlist['title']=mysql_escape_string($songlistArr['title']);
    $songlist['create_time']=mysql_escape_string($songlistArr['create_date']);
    $songlist['creator']=mysql_escape_string($songlistArr['creator']);
    $songlist['raw_url']=mysql_escape_string($songlistArr['raw_url']);
    $songlist['info']=mysql_escape_string($songlistArr['info']);
    $sql="insert into songlist set".
    "title=''".$songlist['title']."'".
    ",creat_time=''".$songlist['create_time']."'".
    ",creator=''".$songlist['creator']."'".
    ",raw_url=''".$songlist['raw_url']."'".
    ",info=''".$songlist['info']."';";
    mysql_query($sql,$conn);
    $songlist_id=mysql_insert_id();
    foreach($songlistArr['songs'] as $song){
      $title=mysql_escape_string($song['title']);
      $sql="insert into song set title='".$title."'" .",songlist_id=".$songlist_id.";";
      mysql_query($sql);

    }
    mysql_close($conn);
  }
}

// Now, create a instance of your class, define the behaviour
// of the crawler (see class-reference for more options and details)
// and start the crawling-process.
//创建一个爬虫
$crawler = new MyCrawler();
//设置一个开始的连接
// URL to crawl
$start_url="www.kugou.com/yy/special/index/1-0-2.html";
$crawler->setURL($start_url);
//设置内容的类型
// Only receive content of files with content-type "text/html"
$crawler->addContentTypeReceiveRule("#text/html#");
//忽略图片,设置那些连接不需要下载

//每一个精选集的连接
$crawler->addURLFollowRule("#http://www\.kugou\.com/yy/special/single/\d+\.html# i");//i 忽略大小写
//精选集页面的链接 下一页
$crawler->addURLFollowRule("#http://www\.kugou\.com/yy/special/index/\d+-0-2.html# i");

// Ignore links to pictures, dont even request pictures
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");

// Store and send cookie-data like a browser does
$crawler->enableCookieHandling(true);

// Set the traffic-limit to 1 MB (in bytes,
// for testing we dont want to "suck" the whole site)
//数据内容的容量,多少m,0是无限的
$crawler->setTrafficLimit(1000 * 1024);

// Thats enough, now here we go
$crawler->go();

// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)
$report = $crawler->getProcessReport();

if (PHP_SAPI == "cli") $lb = "\n";
else $lb = "<br />";
    
echo "Summary:".$lb;
echo "Links followed: ".$report->links_followed.$lb;
echo "Documents received: ".$report->files_received.$lb;
echo "Bytes received: ".$report->bytes_received." bytes".$lb;
echo "Process runtime: ".$report->process_runtime." sec".$lb; 
?>



http://www.niftyadmin.cn/n/1581321.html

相关文章

要写简历了

要写简历了&#xff0c;突然很迷茫。竟然想不出新奇的做法来。好像我的创造力都枯竭了。。。其实是怕失败&#xff0c;怕石沉大海唉&#xff0c;不想了&#xff0c;走一步是一步&#xff0c;好好学习&#xff0c;自信一点&#xff01;加油&#xff01;转载于:https://blog.51ct…

不刷新,点击上传图片,传完图片后,这个页面即刻显示图片?

<input type"file" οnchange"show.srcthis.value" name"img"> <img id"show"/>顯示後還要形成縮略圖&#xff1a;js:functionsetImgSize(imgID,maxWidth,maxHeight){ var img document.images[imgID]; …

RedHat系统下安装yum

一、前言 因为RedHat系统下的软件更新是RedHat公司的一项服务&#xff0c;必须用钱买的rhel系统&#xff0c;并且注册了RedHat的用户才能使用yum&#xff0c;要想免费使用yum&#xff0c;必须卸载原来的yum&#xff0c;安装centos的yum包。 二、卸载rhel的默认安装的yum包 查…

如何在asp.net中动态生成验证码(转)

现在越来越多的网站喜欢搞个验证码出来&#xff0c;而且各个语言基本上都能做到&#xff0c;今天我来一个C#写的&#xff01; using System;using System.Collections;using System.ComponentModel;using System.Data;using System.Drawing;using System.Web;using System.Web.…

js实现图片放大镜效果

一、HTML文件 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head> <meta http-equiv"content-type" content"text/html;ch…

PHP获取文件的修改时间、访问时间和inode 修改时间

filemtime ( string $filename ) 返回文件上次被修改的时间&#xff0c;出错时返回 FALSE。时间以 Unix 时间戳的方式返回&#xff0c;可用于 date()。 例如&#xff1a;$afilemtime("log.txt"); echo "修改时间&#xff1a;".date("Y-m-d…

在Google.com里面进行搜索的时候,经常会遇到突然出现“该页无法显示”的提示...

在Google.com里面进行搜索的时候&#xff0c;经常会遇到突然出现“该页无法显示”的提示&#xff0c;并且之后的十多分钟 本文将给出一些能够正常访问被屏蔽的Google搜索引擎的常用方法以及Google的IP地址表。 在Google.com里面进行搜索的时候&#xff0c;经常会遇到突然出现“…

SQL server 2005的表分区

下面来说下&#xff0c;在SQL SERVER 2005的表分区里&#xff0c;如何对已经存在的有数据的表进行分区&#xff0c;其实道理和之前在http://www.cnblogs.com/jackyrong/archive/2006/11/13/559354.html说到一样&#xff0c;只不过交换下顺序而已&#xff0c;下面依然用例子说明…