Build your own search engine

von Werner Ziegelwanger · 12. November 2012

In this article I will show you how to build your own search engine. Ok, not really a search engine like Bing! or Google. Indeed we want to program a little program called web crawler. But this small peace of code generates all Data that is used by search engines.

What do we want to do?

What does all search engines need as important as oxygen for humans? Yes, data. You can never return a search result without data where you have to search. The question is: where I can get data? The solution is easy: from the internet.

Search engines like Google use so called web crawler. That are simple computer programs that jump from link to link and analyze the data of websites (the HTML code). They read the data and store them. Good web crawler and good search engines store this data in an intelligent way, so that you can get to each search string a perfect answer. And that as fast as possible. For our example we only want to store all links of a HTML site.

My first PHP web crawler

We need some things to get our crawler running:

a database where we can store our search results
a function that returns the HTML code of an URL
time…

<?php
    $url = "http://www.developer-blog.net";

    print_r(find_all_links($url));

    function find_all_links($url)
    {
        $htmlData = file_get_contents($url);
        if(preg_match_all('/<a\s+href=["\']([^"\']+)["\']/i', $htmlData, $links, PREG_PATTERN_ORDER))
        {
            return array_unique($links[1]);
        }
        return array();
    }
?>

Download

This code example represents the core of our web crawler. A function called find_all_links returns an array with all links of a given URL or an empty one if there are no links. For this we use 2 important PHP functions:

file_get_contents
This function returns the content of a file as string. It can also be used for URLs.
preg_match_all
This function allows us to test a string against a regular expression (short regex). For more information about regular expressions please use a search engine. For our need we define a regular expression that finds all <a href> tags of the HTML code. The result is stored in the third param.

Now we have the links. The next thing is to store them somewhere. For this we create a new MySQL database with a simple table:

CREATE TABLE `links` (
    `link` varchar(255) NOT NULL,
    `id` int(10) unsigned NOT NULL auto_increment,
    PRIMARY KEY (`id`),
    UNIQUE KEY `link` (`link`)
) ENGINE=MyISAM CHARSET=utf8

Download

With our new database we can extend our web crawler to be able to store data to this table. This code demonstrates how every single link can be stored into the table.

<?php
    $url = "http://www.developer-blog.net";

    $mysqli = new mysqli("127.0.0.1", "root", "","test");
    if($mysqli->connect_error){
       echo "unable to connect to Database: ".mysqli_connect_error()."\n";
       exit();
    }

    $links = find_all_links($url);
    links_to_db($links, $mysqli);

    function find_all_links($url)
    {
        $htmlData = file_get_contents($url);
        if(preg_match_all('/<a\s+href=["\']([^"\']+)["\']/i', $htmlData, $links, PREG_PATTERN_ORDER))
        {
            return array_unique($links[1]);
        }
        return null;
    }

    function links_to_db($links, $mysql)
    {
        //urls to db
        foreach($links as $link){
        $allowed = true;

        //check if url exists
        $statement = "SELECT * FROM `links` WHERE link = ?";
        if($stmt = $mysql->prepare($statement)){
            $stmt->bind_param("s", $link);
            $stmt->execute();
            $stmt->store_result();
            if($stmt->num_rows>0){
                $allowed=false;
            }
            $stmt->close();
        }

        //insert into DB
        if($allowed){
            $statement = "INSERT INTO `links` (link) VALUES (?)";
            if($stmt = $mysql->prepare($statement)){
                $stmt->bind_param("s", $link);
                $stmt->execute();
                $stmt->close();
            }
        }
    }
}
?>

Download

The function links_to_db is used to get the links to database. For each link we check if it is already stored. That is important, because otherwise we get many equal links. If we want to track the number of links, we can add a column to count it.

Thats all. Our web crawler is ready to grab all links from an HTML site and to store it for search requests. That was only the first step. Real web crawler took all links for the analyzed site and then check all sites that are linked. There it finds new links and so on… If you modify our web crawler to do that you will see, that it works till you stop it or your memory is full.

So what is missing for a real search engine?

As said before, it only stores links. Search engines stores more data. They combine links, content and domains into a context. For this the text of a page is important (you may want to find it again). For a technical point of view we have to consider many things for example meta information. If you want to read more about how search engines work read about SEO. I the last weeks I learned a lot about search engines and why some sites can be found better than other sites.

Alternatives

You can find many libraries and projects that deal with that topic. A really good library is called Caterpillar. I have never used this library for my projects, but I have learned from the source code.

(Visited 1.745 times, 1 visits today)

Kommentare3
Pingbacks0

christian sagt:
27. April 2016 um 11:01 Uhr
hi, i am searching for a solution to crawl and find for special things (beer). would you help me with this tool?
Antworten
I. Gaffling sagt:
3. Januar 2019 um 11:19 Uhr
Nice little Tool you wrote, would you be interested in writing in article to this topic at my very new blog? I started this year (2019) and look for some co-authors. I would love to hear from you!
Antworten
- Werner Ziegelwanger sagt:
  3. Januar 2019 um 16:34 Uhr
  Yes, would be nice.
  Antworten

Cookie	Dauer	Beschreibung
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Build your own search engine

3 Antworten

Schreibe einen Kommentar Antworten abbrechen

Derzeit beliebt

Top 10 Artikel

Build your own search engine

What do we want to do?

My first PHP web crawler

So what is missing for a real search engine?

Alternatives

3 Antworten

Schreibe einen Kommentar Antworten abbrechen

Derzeit beliebt

Schlagworte

Top 10 Artikel