Build your own search engine

In this article I will show you how to build your own search engine. Ok, not really a search engine like Bing! or Google. Indeed we want to program a little program called web crawler. But this small peace of code generates all Data that is used by search engines.

What do we want to do?

What does all search engines need as important as oxygen for humans? Yes, data. You can never return a search result without data where you have to search. The question is: where I can get data? The solution is easy: from the internet.

Search engines like Google use so called web crawler. That are simple computer programs that jump from link to link and analyze the data of websites (the HTML code). They read the data and store them. Good web crawler and good search engines store this data in an intelligent way, so that you can get to each search string a perfect answer. And that as fast as possible. For our example we only want to store all links of a HTML site.

My first PHP web crawler

We need some things to get our crawler running:

  • a database where we can store our search results
  • a function that returns the HTML code of an URL
  • time…

Download

This code example represents the core of our web crawler. A function called find_all_links returns an array with all links of a given URL or an empty one if there are no links. For this we use 2 important PHP functions:

  1. file_get_contents
    This function returns the content of a file as string. It can also be used for URLs.
  2. preg_match_all
    This function allows us to test a string against a regular expression (short regex). For more information about regular expressions please use a search engine. For our need we define a regular expression that finds all <a href> tags of the HTML code. The result is stored in the third param.

Now we have the links. The next thing is to store them somewhere. For this we create a new MySQL database with a simple table:

Download

With our new database we can extend our web crawler to be able to store data to this table. This code demonstrates how every single link can be stored into the table.

Download

The function links_to_db is used to get the links to database. For each link we check if it is already stored. That is important, because otherwise we get many equal links. If we want to track the number of links, we can add a column to count it.

Thats all. Our web crawler is ready to grab all links from an HTML site and to store it for search requests. That was only the first step. Real web crawler took all links for the analyzed site and then check all sites that are linked. There it finds new links and so on… If you modify our web crawler to do that you will see, that it works till you stop it or your memory is full.

So what is missing for a real search engine?

As said before, it only stores links. Search engines stores more data. They combine links, content and domains into a context. For this the text of a page is important (you may want to find it again). For a technical point of view we have to consider many things for example meta information. If you want to read more about how search engines work read about SEO. I the last weeks I learned a lot about search engines and why some sites can be found better than other sites.

Alternatives

You can find many libraries and projects that deal with that topic. A really good library is called Caterpillar. I have never used this library for my projects, but I have learned from the source code.

(Visited 323 times, 1 visits today)

1 Response

  1. christian says:

    hi, i am searching for a solution to crawl and find for special things (beer). would you help me with this tool?

Leave a Reply

Your email address will not be published. Required fields are marked *