July 27, 2009 2

Web scraping PHP class for web developers

By admin in php

Programmers by their very nature are lazy, they are good programmers because they are so lazy they don’t want to do things manually over and over again.

I’ve had this PHP class for years now to get a quick overview of a website’s CSS without having to go there and look for everything myself. The results of this class, combined with a build script to build a CSS file with the appropriate empty selectors based on the findings of the scraper has saved me countless hours of stupid monontonous work.

Basically what it does is scrape the following things from any webpage.

  1. Internal CSS styles
  2. External CSS files
  3. All defined IDs
  4. All defined classes
  5. All defined spans

You can use it like this:

require_once 'scraper.php';
$site = new Scraper('http://www.welt.de');
$excss = $site->getExternalCSS();
$incss = $site->getInternalCSS();
$ids = $site->getIds();
$classes = $site->getClasses();
$spans = $site->getSpans(); 
 
print_r($excss);
print_r($incss);
print_r($ids);
print_r($classes);
print_r($spans);

Here an exerpt of my scraper class, ready to use as is.

class Scraper
{
 
	public function __construct($url)
	{
		$this->url = file_get_contents("$url");
	}
 
	public function getInternalCSS()
	{
	    $tmp = preg_match_all('/(style=")(.*?)(")/is',$this->url,$patterns);
	    $result = array();
	    array_push($result,$patterns[2]);
	    array_push($result,count($patterns[2]));
	    return $result;
	}
 
	public function getExternalCSS()
	{
		$tmp = preg_match_all('/(href=")(\w.*\.css)"/i',$this->url,$patterns);
		$result = array();
		array_push($result,$patterns[2]);
		array_push($result,count($patterns[2]));
		return $result;
	}
 
	public function getIds()
	{
	    $tmp = preg_match_all('/(id="(\w*)")/is',$this->url,$patterns);
	    $result = array();
	    array_push($result,$patterns[2]);
	    array_push($result,count($patterns[2]));
	    return $result;
	}
 
	public function getClasses()
	{
	        $tmp = preg_match_all('/(class="(\w*)")/is',$this->url,$patterns);
		$result = array();
		array_push($result,$patterns[2]);
		array_push($result,count($patterns[2]));
		return $result;
	}
 
	public function getSpans(){
	    $tmp = preg_match_all('/(<span>)(.*)(<\/span>)/',$this->url,$patterns);
	    $result = array();
	    array_push($result,$patterns[2]);
	    array_push($result,count($patterns[2]));
	    return $result;
	}
 
}

Tags: , ,

2 Responses to “Web scraping PHP class for web developers”

  1. svnlabs says:

    cool article…

    Try some tricks related to DOM…

    SV

  2. keyur patel says:

    please provide such useful articles on regular basis

    thanks for this one

Leave a Reply

Comment Spam Protection by WP-SpamFree