Build your own search engine
In this article I will show you how to build your own search engine. Ok, not really a search engine like Bing! or Google. Indeed we want to program a little program called web crawler. But this small peace of code generates all Data that is used by search engines.
What do we want to do?
What does all search engines need as important as oxygen for humans? Yes, data. You can never return a search result without data where you have to search. The question is: where I can get data? The solution is easy: from the internet.
Search engines like Google use so called web crawler. That are simple computer programs that jump from link to link and analyze the data of websites (the HTML code). They read the data and store them. Good web crawler and good search engines store this data in an intelligent way, so that you can get to each search string a perfect answer. And that as fast as possible. For our example we only want to store all links of a HTML site.
My first PHP web crawler
We need some things to get our crawler running:
- a database where we can store our search results
- a function that returns the HTML code of an URL
- time…
<?php $url = "http://www.developer-blog.net"; print_r(find_all_links($url)); function find_all_links($url) { $htmlData = file_get_contents($url); if(preg_match_all('/<a\s+href=["\']([^"\']+)["\']/i', $htmlData, $links, PREG_PATTERN_ORDER)) { return array_unique($links[1]); } return array(); } ?>
This code example represents the core of our web crawler. A function called find_all_links returns an array with all links of a given URL or an empty one if there are no links. For this we use 2 important PHP functions:
- file_get_contents
This function returns the content of a file as string. It can also be used for URLs. - preg_match_all
This function allows us to test a string against a regular expression (short regex). For more information about regular expressions please use a search engine. For our need we define a regular expression that finds all <a href> tags of the HTML code. The result is stored in the third param.
Now we have the links. The next thing is to store them somewhere. For this we create a new MySQL database with a simple table:
CREATE TABLE `links` ( `link` varchar(255) NOT NULL, `id` int(10) unsigned NOT NULL auto_increment, PRIMARY KEY (`id`), UNIQUE KEY `link` (`link`) ) ENGINE=MyISAM CHARSET=utf8
With our new database we can extend our web crawler to be able to store data to this table. This code demonstrates how every single link can be stored into the table.
<?php $url = "http://www.developer-blog.net"; $mysqli = new mysqli("127.0.0.1", "root", "","test"); if($mysqli->connect_error){ echo "unable to connect to Database: ".mysqli_connect_error()."\n"; exit(); } $links = find_all_links($url); links_to_db($links, $mysqli); function find_all_links($url) { $htmlData = file_get_contents($url); if(preg_match_all('/<a\s+href=["\']([^"\']+)["\']/i', $htmlData, $links, PREG_PATTERN_ORDER)) { return array_unique($links[1]); } return null; } function links_to_db($links, $mysql) { //urls to db foreach($links as $link){ $allowed = true; //check if url exists $statement = "SELECT * FROM `links` WHERE link = ?"; if($stmt = $mysql->prepare($statement)){ $stmt->bind_param("s", $link); $stmt->execute(); $stmt->store_result(); if($stmt->num_rows>0){ $allowed=false; } $stmt->close(); } //insert into DB if($allowed){ $statement = "INSERT INTO `links` (link) VALUES (?)"; if($stmt = $mysql->prepare($statement)){ $stmt->bind_param("s", $link); $stmt->execute(); $stmt->close(); } } } } ?>
The function links_to_db is used to get the links to database. For each link we check if it is already stored. That is important, because otherwise we get many equal links. If we want to track the number of links, we can add a column to count it.
Thats all. Our web crawler is ready to grab all links from an HTML site and to store it for search requests. That was only the first step. Real web crawler took all links for the analyzed site and then check all sites that are linked. There it finds new links and so on… If you modify our web crawler to do that you will see, that it works till you stop it or your memory is full.
So what is missing for a real search engine?
As said before, it only stores links. Search engines stores more data. They combine links, content and domains into a context. For this the text of a page is important (you may want to find it again). For a technical point of view we have to consider many things for example meta information. If you want to read more about how search engines work read about SEO. I the last weeks I learned a lot about search engines and why some sites can be found better than other sites.
Alternatives
You can find many libraries and projects that deal with that topic. A really good library is called Caterpillar. I have never used this library for my projects, but I have learned from the source code.
hi, i am searching for a solution to crawl and find for special things (beer). would you help me with this tool?
Nice little Tool you wrote, would you be interested in writing in article to this topic at my very new blog? I started this year (2019) and look for some co-authors. I would love to hear from you!
Yes, would be nice.