Open In App

How to make a spider-bot in PHP ?

Last Updated : 22 Nov, 2019
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we’ll see how to make both a simple and relatively advanced web-crawler (or spider-bot) in PHP. The simpler one will simply output all the links it finds in a webpage while the advanced one will add the titles, keywords and descriptions to a conceptual database (conceptual means there is no SQL database being used in this article)! Compared to Google even our advanced web-crawler is actually just a simple web-crawler since our crawler doesn’t use any AI-agent! We’ll go through a total of 3 iterations before concluding the article – each one of them with an explanation.

Note: Throughout this article, I’ll use the words spider-bot and web-crawler interchangeably. Some people may use them in a different context but in this article, both words mean essentially the same thing.

There are a lots of things you can do to improve this spider-bots and make it more advanced – add functionality like maintaining a popularity index and also implementing some anti-spam features like penalizing websites with no content or websites using “click-bait” strategies such as adding keywords that have nothing to do with the content of the page! Also, you could try to generate the keywords and description from the page which is something GoogleBot’s do all the time. Below is a list of relevant articles that you can look up if you want to improve this spider-bot.

A Simple SpiderBot: The simple version will be non-recursive and will simply print all the links it finds in a web-page. Note that all of our main-logic will happen in followLinks function!

  • Program:




    <?php function followLink($url) {
      
        // We need this options when creating context
        $options = array(
            'http' => array(
                'method' => "GET",
                'user-agent' => "gfgBot/0.1\n"
            )
        );
      
        // Create context for communication
        $context=stream_context_create( $options );
      
        // Create a new HTML DomDocument for web-scraping
        $doc = new DomDocument();
      
        @$doc -> loadHTML( file_get_contents($url, false, $context) );
      
        // Get all the anchor nodes in the DOM
        $links = $doc -> getElementsByTagName('a');
      
        // Iterate through all the anchor nodes
        // found in the document
        foreach ($links as $i)
            echo $i->getAttribute('href') . '<br/>';
    }
      
    followLink("http://example.com");
    ?>

    
    

  • Output: Now, this was no good – we get only one link – that’s because we only have one link in the site example.com and since we are not recursing we don’t follow the link that we got. You could run followLink("http://apple.com") if you want to see it in complete action. If however, you use geeksforgeeks.com then you may get some error since GeeksforGeeks will block our request (for security reasons of course).
    https://www.iana.org/domains/example

Explanation:

  • Line 3: We are creating an $options array. You don’t have to understand much about it other than that this will be required in context-creation. Note that the user-agent name is gfgBot – you can change this to what you like. You can even use GoogleBot to fool a website into thinking that your crawler is Google’s spider-bot as long as it uses this method for finding out the bot.
  • Line 10: We are creating context for communication. For anything you need context – to tell a story you need a context. To create a window in OpenGL you need a context – same for HTML5 Canvas and same for PHP Network Communication! Sorry if I got out of “context” but I had to do that.
  • Line 13: Create a DomDocument which is basically a data structure for DOM handling used generally for HTML and XML files.
  • Line 14: We load HTML by providing the contents of the document! This process may create some warnings (since it’s kind deprecated) so we suppress all the warnings.
  • Line 17: We create basically an array of all the anchor nodes that we find in the DOM.
  • Line 21: We print all the links that those anchor nodes reference to.
  • Line 24: We get all the links in the website example.com! It has only one-link which is outputted.

A slightly more complicated Spider-Bot: In the previous code we had a basic spider-bot and it was good but it was more of a scraper then a crawler (for difference between scraper and crawler see this article). We weren’t recursing – we weren’t “following” the links that we got. So in this iteration, we’ll do just that and we’ll also assume we have a database in which we’d insert the links (for indexing). Any link will be inserted in the database via insertIntoDatabase function!

  • Program:




    <?php
      
        // List of all the links we have crawled!
        $crawledLinks = array();
          
        function followLink($url, $depth = 0){
      
            global $crawledLinks;
            $crawling = array();
      
            // Give up to prevent any seemingly infinite loop
            if ($depth>5){
                echo "<div style='color:red;'>The Crawler is giving up!</div>";
                return;
            }
              
            $options = array(
                'http' => array(
                    'method' => "GET",
                    'user-agent' => "gfgBot/0.1\n"
                )
            );
      
            $context = stream_context_create($options);
            $doc = new DomDocument();
            @$doc -> loadHTML(file_get_contents($url, false, $context));
            $links = $doc->getElementsByTagName('a');
      
            foreach ($links as $i){
      
                $link = $i->getAttribute('href');
      
                if (ignoreLink($link)) continue;
      
                $link = convertLink($url, $link);
                  
                if (!in_array($link, $crawledLinks)){
                    $crawledLinks[] = $link;
                    $crawling[] = $link;
                    insertIntoDatabase($link, $depth);
                }
            }
            foreach ($crawling as $crawlURL){
                echo ("<span style='color:grey;margin-left:".(10*$depth).";'>".
                    "[+] Crawling <u>$crawlURL</u></span><br/>");
                followLink($crawlURL, $depth+1);
            }
      
            if (count($crawling)==0)
                echo ("<span style='color:red;margin-left:".(10*$depth).";'>".
                    "[!] Didn't Find any Links in <u>$url!</u></span><br/>");
        }
      
        // Converts Relative URL to Absolute URL
        // No conversion is done if it is already in Absolute URL
        function convertLink($site, $path){
            if (substr_compare($path, "//", 0, 2) == 0)
                return parse_url($site)['scheme'].$path;
            elseif (substr_compare($path, "http://", 0, 7) == 0 or
                substr_compare($path, "https://", 0, 8) == 0 or 
                substr_compare($path, "www.", 0, 4) == 0)
      
                return $path; // Absolutely an Absolute URL!!
            else
                return $site.'/'.$path;
        }
      
        // Whether or not we want to ignore the link
        function ignoreLink($url){
            return $url[0]=="#" or substr($url, 0, 11) == "javascript:";
        }
      
        // Print a message and insert into the array/database!
        function insertIntoDatabase($link, $depth){
            echo (
                "<span style='margin-left:".(10*$depth)."'>".
                "Inserting new Link:- <span style='color:green'>$link".
                "</span></span><br/>"
            );
            $crawledLinks[]=$link;
        }
      
        followLink("http://guimp.com/")
    ?>

    
    

  • Output:
    Inserting new Link:- http://guimp.com//home.html
    [+] Crawling http://guimp.com//home.html
      Inserting new Link:- http://www.guimp.com
        Inserting new Link:- http://guimp.com//home.html/pong.html
        Inserting new Link:- http://guimp.com//home.html/blog.html
          [+] Crawling http://www.guimp.com
            Inserting new Link:- http://www.guimp.com/home.html
            [+] Crawling http://www.guimp.com/home.html
              Inserting new Link:- http://www.guimp.com/home.html/pong.html
              Inserting new Link:- http://www.guimp.com/home.html/blog.html
              [+] Crawling http://www.guimp.com/home.html/pong.html
                [!] Didn't Find any Links in http://www.guimp.com/home.html/pong.html!
              [+] Crawling http://www.guimp.com/home.html/blog.html
                [!] Didn't Find any Links in http://www.guimp.com/home.html/blog.html!
          [+] Crawling http://guimp.com//home.html/pong.html
            [!] Didn't Find any Links in http://guimp.com//home.html/pong.html!
          [+] Crawling http://guimp.com//home.html/blog.html
            [!] Didn't Find any Links in http://guimp.com//home.html/blog.html!

Explanation:

  • Line 3: Create a global array – $crawledLinks which contain all the links that we have captured in the session. We’ll use it for lookup to see whether or not a link is already in the database! Looking up in an array is less efficient than looking up in a hashtable. We could use a hashtable but it won’t be very efficient than an array since the keys are very long strings (a URL) So I believe using an array would be faster.
  • Line 8: We tell the interpreter that we are using the global array $crawledLinks that we just created! And in the next line we create a new array $crawling which would simply contain all the links that we are currently crawling over.
  • Line 31: We ignore all the links that do not link to an external page! A link could be an internal link, deep link or system link! This function doesn’t check for every case (that’d make it very long) but the two most common cases – when a link is an internal link and when a link refers to a javascript code.
  • Line 33: We convert a link from relative link to absolute link as well do some other conversions (like //wikipedia.org to http://wikipedia.org or https://wikipedia.org depending on the scheme of the original URL).
  • Line 35: We simply check if the $link that we are iterating is not already in the database. If it is then we ignore it – if not then we add it to the database as well as in the $crawling array so that we could follow the links in that URL as well.
  • Line 43: Here the crawler recurses. It follows all the links that it has to follow (links that were added in the $crawling array).
  • Line 83: We call followLink("http://guimp.com/"), we use the URL http://guimp.com/ as the starting point just cause it so happens to be (or claims to be) the smallest website in the world.

More advanced spider-bot: In the previous iteration, we recursively followed all the links that we got on a page and added them in a database (which was just an array). But we added only the URL to the database, however, search-engines have a lot of fields for each page – the thumbnail, the author information, date and time and most importantly the title of the page and the keywords. Some even have a cached copy of the page for faster search. We’ll, however – for the sake of simplicity, only scrape out the title, description and the keywords from the page.

Note: It’s left to you which database you use – PostgreSQL, MariaDB, etc, we’ll only output Inserting URL/Text, etc since handling with external databases are out of this article’s scope!

The description and keywords are present in the meta tags. Some search-engines search are based (almost) entirely on the meta-data information while some search-engines don’t give them much relevance. Google doesn’t even take them into consideration, their search is based entirely on the popularity and relevance of a page (using the PageRank algorithm) and the keywords and description are generated rather than extracting them from the meta tags. Google doesn’t penalize a website without any description or keywords. But it does penalize the websites without titles. Our conceptual search engine (which will be built using this “advanced” spider-bot) will do the opposite, it will penalize the websites without description and keywords (even though it would add them to the database yet it will give them lower-ranking) and it will not penalize the websites without titles. It’ll set the URL of the website as the title.

  • Program:




    <?php
        $crawledLinks=array();
        const MAX_DEPTH=5;
          
        function followLink($url, $depth=0){
            global $crawledLinks;
            $crawling=array();
            if ($depth>MAX_DEPTH){
                echo "<div style='color:red;'>The Crawler is giving up!</div>";
                return;
            }
            $options=array(
                'http'=>array(
                    'method'=>"GET",
                    'user-agent'=>"gfgBot/0.1\n"
                )
            );
            $context=stream_context_create($options);
            $doc=new DomDocument();
            @$doc->loadHTML(file_get_contents($url, false, $context));
            $links=$doc->getElementsByTagName('a');
            $pageTitle=getDocTitle($doc, $url);
            $metaData=getDocMetaData($doc);
            foreach ($links as $i){
                $link=$i->getAttribute('href');
                if (ignoreLink($link)) continue;
                $link=convertLink($url, $link);
                if (!in_array($link, $crawledLinks)){
                    $crawledLinks[]=$link;
                    $crawling[]=$link;
                    insertIntoDatabase($link, $pageTitle, $metaData, $depth);
                }
            }
            foreach ($crawling as $crawlURL)
                followLink($crawlURL, $depth+1);
        }
          
        function convertLink($site, $path){
            if (substr_compare($path, "//", 0, 2)==0)
                return parse_url($site)['scheme'].$path;
            elseif (substr_compare($path, "http://", 0, 7)==0 or
                substr_compare($path, "https://", 0, 8)==0 or 
                substr_compare($path, "www.", 0, 4)==0)
                return $path;
            else
                return $site.'/'.$path;
        }
      
        function ignoreLink($url){
            return $url[0]=="#" or substr($url, 0, 11) == "javascript:";
        }
      
        function insertIntoDatabase($link, $title, &$metaData, $depth){
            echo (
                "Inserting new record {URL= $link".
                ", Title = '$title'".
                ", Description = '".$metaData['description'].
                "', Keywords = ' ".$metaData['keywords'].
                "'}<br/><br/><br/>"
            );
            $crawledLinks[]=$link;
        }
      
        function getDocTitle(&$doc, $url){
            $titleNodes=$doc->getElementsByTagName('title');
            if (count($titleNodes)==0 or !isset($titleNodes[0]->nodeValue))
                return $url;
            $title=str_replace('', '\n', $titleNodes[0]->nodeValue);
            return (strlen($title)<1)?$url:$title;
        }
      
        function getDocMetaData(&$doc){
            $metaData=array();
            $metaNodes=$doc->getElementsByTagName('meta');
            foreach ($metaNodes as $node)
                $metaData[$node->getAttribute("name")]
                         = $node->getAttribute("content");
            if (!isset($metaData['description']))
                $metaData['description']='No Description Available';
            if (!isset($metaData['keywords'])) $metaData['keywords']='';
            return array(
                'keywords'=>str_replace('', '\n', $metaData['keywords']),
                'description'=>str_replace('', '\n', $metaData['description'])
            );
        }
      
        followLink("http://example.com/")
    ?>

    
    

  • Output:
    Inserting new record {URL= https://www.iana.org/domains/example, 
    Title = 'Example Domain', Description = 'No Description Available',
     Keywords = ' '}

Explanation: There is nothing ground-breaking change yet I’d like to explain a few stuff:

  • Line 3: We are creating a new global constant MAX_DEPTH. Previously we simply used 5 as the maximum depth but this time we use MAX_DEPTH constant in place of that.
  • Line 22 & Line 23: We are basically getting the title of the page in $pageTitle and the description and keywords which would be stored in the $metaData variable (an associative array). You can refer to line 64 and line 72 to know about the information that was abstracted.
  • Line 31: We pass in some extra parameters to the insertIntoDatabase function.

Issues with our web-crawler: We have created this web-crawler only for learning. Deploying it into production code (like making a search-engine out of it) can create some serious problems. The following are some issues with our web-crawler:

  1. It isn’t scalable Our web-crawler cannot crawl billions of web-pages like GoogleBot.
  2. It doesn’t quite obey the standard of crawler communication with websites. It doesn’t follow the robots.txt for a site and will crawl a site even if the site administrator requests not to do so.
  3. It is not automatic. Sure it will “automatically” get all the URLs of the current-page and crawl each one of them but it’s not exactly automatic. It doesn’t have any concept of Crawl Frequency.
  4. It is not distributed. If two spider-bots are running then there’s currently no way that they could communicate with each other (to see if the other spider-bot is not crawling the same page)
  5. Parsing is way too simple. Our spider-bot will not handle encoded markup (or even encoded URLs for that matter)


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads