Monday, August 04, 2014

Simple web spider with PHP Goutte | Z.Rashwani Blog

Last week we got SEO analysis about one of our portals, that analysis included thorough  statistics about website SEO measures, like missing and duplicate <title>,<h1> and meta tags, broken and invalid links, duplicate content percentage…etc . It appears that the SEO agency that prepared that analysis use some sort of crawler to extract these information.
I liked that crawler idea, and wanted to implement it in PHP. After some reading of web scrapping and Goutte I was able to write a similar web spider that extracts the needed information, and I wanted to share it in this post.

About web scrapping and Goutte

Web scrapping is a technique to extract information from websites, its very close to web indexing because the bot or web crawler that search engines use, performs some sort of scrapping the web documents through following the links, analyzing keywords, meta tags, URLs and ranking them according to relevancy, popularity, engagement..etc.
Goutte is a screen scraping and web crawling library for PHP, it provides an API to crawl websites and extract data from the HTML/XML responses. Goutte is wrapper around Guzzle and several Symfony components like: BrowserKit, DOMCrawler and CSSSelector.
Here is a small description about some libraries that Goutte wraps:
    1. Guzzle: framework for building RESTful web service, it provides a simple interface to perform cURL, along with other important features like: persistent connections and streaming request and response bodies.
    2. BrowserKit: simluates a behaviour of a web browser, providing abstract HTTP layer like request, response, cookie…etc.
    3. DOMCrawler: provides easy methods for DOM navigation and manipulation.
    4. CSSSelector: provide an API to select elements using same selectors used for CSS (it becomes exremely easy to select elements when it works with DOMCrawler).
* These are the main components I interested in for this post, however, other components like:Finder and Process are also used in Goutte.

Basic usage

Once you download Goutte(from here),  you should define a Client object, the client used to send requests to a website and returns a crawler object, as in the snippet below:
1
2
3
4
5
6
7
require_once 'goutte.phar';
use Goutte\Client;
$url_to_traverse = 'http://zrashwani.com';
$client = new Client();
$crawler = $client->request('GET'$url_to_traverse);

Here I declared a client object, and called “Request()” to simulate browser requesting the url “http://zrashwani.com” using “GET” http method.
Request() method returns an object of type Symfony\Component\DomCrawler\Crawler, than can be used to select elements from the fetched html response.

but before processing the document, let’s ensure that this URL is a valid link, which means it returned a response code (200), using
1
2
3
4
$status_code = $client->getResponse()->getStatus();
if($status_code==200){
    //process the documents
}

$client->getResponse() method returns BrowserKit/Response object that contains information about the response the client got, like: headers (including status code I used here), response content…etc

In order to extract document title, you can filter either by XPath or CSS selector in order to get you target HTML DOM element value
1
2
$crawler->filterXPath('html/head/title')->text()
// $crawler->filter('title')->text()
In order to get the number of <h1> tags, and get the contents of the tags that exist in the page,
1
2
3
4
5
6
7
$h1_count = $crawler->filter('h1')->count();
$h1_contents = array();
if ($h1_count) {
    $crawler->filter('h1')->each(function(Symfony\Component\DomCrawler\Crawler $node$iuse($h1_contents) {
                $h1_contents[$i= trim($node->text());
        });
}

for SEO Purposes, there should be one h1 tag in a page, and its content should have the main keywords in the page. Here each() function is quite useful, it can be used to loop over all matching elements. each() function takes a closure as a parameter to perform some callback operation on the node.

PHP closures is anonymous functions that started to be used in PHP5.3, its very useful to perform a callback functionality, you can refer to PHP manual if you are new to closures.

Application goals

After this brief introduction, I can begin explaining the spider functionality, this crawler will detect broken/invalid links in the website, along with extracting <h1>,<title> tag values that are important for SEO issue that I have.
my simple crawler implements Depth-limited search, in order to avoid crawling large amounts of data, and works as following :
    1. Read the initial URL to crawl along with depth of links to be visited.
    2. crawl the url and check the response code to determine the link is not broken, then add it to an array containing site links.
    3. extract <title>, <h1> tags content in order to use their values later for reporting.
    4. loop over all <a> tags inside the document fetch to extract their href attribute along with other data.
    5. check that depth limit is not reached, and the current href is not visited before, and the link url does not belong to external site.
    6. crawl the child link by repeating steps (2-5).
    7. stop when the links depth is reached.

These steps implemented in SimpleCrawler class that I wrote, (It still a basic version and should be optimized more):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
<?php
require_once 'goutte.phar';
use Goutte\Client;
class simpleCrawler {
    private $base_url;
    private $site_links;
    private $max_depth;
    public function __construct($base_url$max_depth = 10) {
        if (strpos($base_url'http'=== false) { // http protocol not included, prepend it to the base url
            $base_url = 'http://' . $base_url;
        }
        $this->base_url = $base_url;
        $this->site_links = array();
        $this->max_depth = $max_depth;
    }
    /**
     * checks the uri if can be crawled or not
     * in order to prevent links like "javascript:void(0)" or "#something" from being crawled again
     * @param string $uri
     * @return boolean
     */
    protected function checkIfCrawlable($uri) {
        if (empty($uri)) {
            return false;
        }
        $stop_links = array(//returned deadlinks
            '@^javascript\:void\(0\)$@',
            '@^#.*@',
        );
        foreach ($stop_links as $ptrn) {
            if (preg_match($ptrn$uri)) {
                return false;
            }
        }
        return true;
    }
    /**
     * normalize link before visiting it
     * currently just remove url hash from the string
     * @param string $uri
     * @return string
     */
    protected function normalizeLink($uri) {
        $uri = preg_replace('@#.*$@'''$uri);
        return $uri;
    }
    /**
     * initiate the crawling mechanism on all links
     * @param string $url_to_traverse
     */
    public function traverse($url_to_traverse = null) {
        if (is_null($url_to_traverse)) {
            $url_to_traverse = $this->base_url;
            $this->site_links[$url_to_traverse= array(//initialize first element in the site_links 
                'links_text' => array("BASE_URL"),
                'absolute_url' => $url_to_traverse,
                'frequency' => 1,
                'visited' => false,
                'external_link' => false,
                'original_urls' => array($url_to_traverse),
            );
        }
        $this->_traverseSingle($url_to_traverse$this->max_depth);
    }
    /**
     * crawling single url after checking the depth value
     * @param string $url_to_traverse
     * @param int $depth
     */
    protected function _traverseSingle($url_to_traverse$depth) {
        //echo $url_to_traverse . chr(10);
        try {
            $client = new Client();
            $crawler = $client->request('GET'$url_to_traverse);
            $status_code = $client->getResponse()->getStatus();
            $this->site_links[$url_to_traverse]['status_code'= $status_code;
            if ($status_code == 200) { // valid url and not reached depth limit yet            
                $content_type = $client->getResponse()->getHeader('Content-Type');                
                if (strpos($content_type'text/html'!== false) { //traverse children in case the response in HTML document 
                   $this->extractTitleInfo($crawler$url_to_traverse);
                   $current_links = array();
                   if (@$this->site_links[$url_to_traverse]['external_link'== false) { // for internal uris, get all links inside
                      $current_links = $this->extractLinksInfo($crawler$url_to_traverse);
                   }
                   $this->site_links[$url_to_traverse]['visited'= true// mark current url as visited
                   $this->traverseChildLinks($current_links$depth - 1);
                }
            }
            
        } catch (Guzzle\Http\Exception\CurlException $ex) {
            error_log("CURL exception: " . $url_to_traverse);
            $this->site_links[$url_to_traverse]['status_code'= '404';
        } catch (Exception $ex) {
            error_log("error retrieving data from link: " . $url_to_traverse);
            $this->site_links[$url_to_traverse]['status_code'= '404';
        }
    }
    /**
     * after checking the depth limit of the links array passed
     * check if the link if the link is not visited/traversed yet, in order to traverse
     * @param array $current_links
     * @param int $depth     
     */
    protected function traverseChildLinks($current_links$depth) {
        if ($depth == 0) {
            return;
        }
        foreach ($current_links as $uri => $info) {
            if (!isset($this->site_links[$uri])) {
                $this->site_links[$uri= $info;
            } else{
                $this->site_links[$uri]['original_urls'= isset($this->site_links[$uri]['original_urls'])?array_merge($this->site_links[$uri]['original_urls'], $info['original_urls']):$info['original_urls'];
                $this->site_links[$uri]['links_text'= isset($this->site_links[$uri]['links_text'])?array_merge($this->site_links[$uri]['links_text'], $info['links_text']):$info['links_text'];
                if(@$this->site_links[$uri]['visited']) { //already visited link)
                    $this->site_links[$uri]['frequency'= @$this->site_links[$uri]['frequency'+ @$info['frequency'];
                }
            }
            if (!empty($uri&& 
                !$this->site_links[$uri]['visited'&& 
                !isset($this->site_links[$uri]['dont_visit'])
                ) { //traverse those that not visited yet                
                $this->_traverseSingle($this->normalizeLink($current_links[$uri]['absolute_url']), $depth);
            }
        }
    }
    /**
     * extracting all <a> tags in the crawled document, 
     * and return an array containing information about links like: uri, absolute_url, frequency in document
     * @param Symfony\Component\DomCrawler\Crawler $crawler
     * @param string $url_to_traverse
     * @return array
     */
    protected function extractLinksInfo(Symfony\Component\DomCrawler\Crawler &$crawler$url_to_traverse) {
        $current_links = array();
        $crawler->filter('a')->each(function(Symfony\Component\DomCrawler\Crawler $node$iuse (&$current_links) {
                    $node_text = trim($node->text());
                    $node_url = $node->attr('href');
                    $hash = $this->normalizeLink($node_url);
                    if (!isset($this->site_links[$hash])) {  
                        $current_links[$hash]['original_urls'][$node_url= $node_url;
                        $current_links[$hash]['links_text'][$node_text= $node_text;
                        
            if (!$this->checkIfCrawlable($node_url)){
            }elseif (!preg_match("@^http(s)?@"$node_url)) { //not absolute link                            
                            $current_links[$hash]['absolute_url'= $this->base_url . $node_url;
                        } else {
                            $current_links[$hash]['absolute_url'= $node_url;
                        }
                        if (!$this->checkIfCrawlable($node_url)) {
                            $current_links[$hash]['dont_visit'= true;
                            $current_links[$hash]['external_link'= false;
                        } elseif ($this->checkIfExternal($current_links[$hash]['absolute_url'])) { // mark external url as marked                            
                            $current_links[$hash]['external_link'= true;
                        } else {
                            $current_links[$hash]['external_link'= false;
                        }
                        $current_links[$hash]['visited'= false;
                        
                        $current_links[$hash]['frequency'= isset($current_links[$hash]['frequency']) ? $current_links[$hash]['frequency']++ : 1// increase the counter
                    }
                    
                });
        if (isset($current_links[$url_to_traverse])) { // if page is linked to itself, ex. homepage
            $current_links[$url_to_traverse]['visited'= true// avoid cyclic loop                
        }
        return $current_links;
    }
    /**
     * extract information about document title, and h1
     * @param Symfony\Component\DomCrawler\Crawler $crawler
     * @param string $uri
     */
    protected function extractTitleInfo(Symfony\Component\DomCrawler\Crawler &$crawler$url) {
        $this->site_links[$url]['title'= trim($crawler->filterXPath('html/head/title')->text());
        $h1_count = $crawler->filter('h1')->count();
        $this->site_links[$url]['h1_count'= $h1_count;
        $this->site_links[$url]['h1_contents'= array();
        if ($h1_count) {
            $crawler->filter('h1')->each(function(Symfony\Component\DomCrawler\Crawler $node$iuse($url) {
                        $this->site_links[$url]['h1_contents'][$i= trim($node->text());
                    });
        }
    }
    /**
     * getting information about links crawled
     * @return array
     */
    public function getLinksInfo() {
        return $this->site_links;
    }
    /**
     * check if the link leads to external site or not
     * @param string $url
     * @return boolean
     */
    public function checkIfExternal($url) {
        $base_url_trimmed = str_replace(array('http://''https://'), ''$this->base_url);
        if (preg_match("@http(s)?\://$base_url_trimmed@"$url)) { //base url is not the first portion of the url
            return false;
        } else {
            return true;
        }
    }
}
?>
and you can try this class functionality as following:
1
2
3
$simple_crawler = new simpleCrawler($url_to_crawl$depth);    
$simple_crawler->traverse();    
$links_data = $simple_crawler->getLinksInfo();
getLinksInfo() method returns an associative array, containing information about each page crawled, such as url of the page, <title>, <h1> tags contents, status_code…etc. You can store these results in any way you like, for me I prefer MySQL for simplicity in order to be able to get desired results using query, so I created pages_crawled table as following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
CREATE TABLE `pages_crawled` (
  `idint(11NOT NULL AUTO_INCREMENT,
  `urlvarchar(255DEFAULT NULL,
  `frequencyint(11unsigned DEFAULT NULL,
  `titlevarchar(255DEFAULT NULL,
  `status_codeint(11DEFAULT NULL,
  `h1_countint(11unsigned DEFAULT NULL,
  `h1_contenttext,
  `source_link_textvarchar(255DEFAULT NULL,
  `original_urlstext,
  `is_externaltinyint(1DEFAULT '0',
  `created_attimestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
ENGINE=InnoDB AUTO_INCREMENT=37 DEFAULT CHARSET=utf8
and here I store the links traversed into mysql table:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
<?php 
error_reporting(E_ALL);
set_time_limit(300);
include_once ('../src/SimpleCrawler.php');
$url_to_crawl = $argv[1];
$depth = isset($argv[2])?$argv[2]:3;
if($url_to_crawl){
    
    echo "Begin crawling ".$url_to_crawl.' with links in depth '.$depth.chr(10);
    
    $start_time = time();    
    $simple_crawler = new simpleCrawler($url_to_crawl$depth);    
    $simple_crawler->traverse();    
    $links_data = $simple_crawler->getLinksInfo();
       
    $end_time = time();
    
    $duration = $end_time - $start_time;
    echo 'crawling approximate duration, '.$duration.' seconds'.chr(10);
    echo count($links_data)." unique links found".chr(10);
    
    mysql_connect('localhost''root''root');
    mysql_select_db('crawler_database');
    foreach($links_data as $uri=>$info){
        
        if(!isset($info['status_code'])){
            $info['status_code']=000;//tmp
        }
        
        $h1_contents = implode("\n\r"isset($info['h1_contents'])?$info['h1_contents']:array() );
        $original_urls = implode('\n\r'isset($info['original_urls'])?$info['original_urls']:array() );
        $links_text = implode('\n\r',  isset($info['links_text'])?$info['links_text']:array() );
        $is_external = $info['external_link']?'1':'0';
        $title = @$info['title'];
        $h1_count = isset($info['h1_count'])?$info['h1_count']:0;
        
        $sql_query = "insert into pages_crawled(url, frequency, status_code, is_external, title, h1_count, h1_content, source_link_text, original_urls)
values('$uri', {$info['frequency']}, {$info['status_code']}, {$is_external}, '{$title}', {$h1_count}, '$h1_contents', '$links_text', '$original_urls')";
        
        mysql_query($sql_queryor die($sql_query);
    }
}

Running the spider

Now let me try out the spider on my blog url, with depth of links to be visited is 2:
1
C:\xampp\htdocs\Goutte\web>php -f test.php zrashwani.com 2
Now I can get the important information that I needed using simple SQL query of the pages_crawled table, as following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
mysql> select count(*) from pages_crawled where h1_count >1;
+----------+
count(*) |
+----------+
|       30 |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) as ctitle from pages_crawled group by title having c>1;
+---+----------------------------------------------------------+
c | title                                                    |
+---+----------------------------------------------------------+
2 | Z.Rashwani Blog | I write here whatever comes to my mind |
+---+----------------------------------------------------------+
1 row in set (0.02 sec)

in the first query, I returned the number of pages with duplicate h1 tags ( I find alot, I will consider changing the HTML structure of my blog a little bit),
in the second one, I returned the duplicated page titles.
now we can get many other statistics on the pages traversed using information we collected.

Conclusion

In this post I explained how to use Goutte for web scrapping using real-world example that I encountered in my job. Goutte can be easily used to extract great amount of information about any webpage using its easy API for requesting pages, analyzing the response and extract specific data from Dom document.
I used Goutte to extract some information that can be used as SEO measures about the specified website, and stored them into MySQL table in order query any report or statistics derived from them.

Update

thanks to Josh Lockhart, this code is modified for composer and Packagist and now available on github https://github.com/codeguy/arachnid

No comments: