Jul 29, 2014. Scrape, Index, and Search HTML

Scraping stuff is always fun. This quick post will get you started with a really simple example. We’re going to scrape the first page of the Fulltext Diary and then play around with the resulting HTML documents. Take a look.

First steps

Download Sphinx and Simple HTML Dom. Simple HTML Dom makes traversing (and transforming) HTML documents… simple.

Set up a realtime index

I’m just going to grab the url and content from each of the blog posts on the first page of the Fulltext Diary. So, in the end, there will only be 10 documents (and they’ll be the most recent).

Your index might have a content field and a string attribute for the url, like this:

    index rt
{
    type		= rt
    path		= /var/lib/sphinx/data/rt
    rt_attr_string	= url
    rt_field 		= content
}
 
   searchd
{
    listen		= 9306:mysql41
    log			= /var/log/sphinx/searchd.log
    query_log		= /var/log/sphinx/query.log
    query_log_format	= sphinxql
    read_timeout	= 5
    max_children	= 30
    pid_file		= /var/run/sphinx/searchd.pid
    workers		= threads 
}

The urls are stored as string attributes so that they’ll be visible in the result set.

Perfect.

PHP

Now, you might write some PHP. And, it might look something like this:

<!--?php 
error_reporting(E_ALL); 
ini_set('display_errors', 1);
include 'simple_html_dom.php';
 
//for the id 
$i = 0;
 
//get the html 
$html = file_get_html('http://sphinxsearch.com/blog');
 
//connect to Sphinx 
$mysqli = new mysqli('127.0.0.1', '', '', '', '9306');
 
if ($mysqli----->connect_error) {
    die('Connect Error (' . $mysqli->connect_errno . ') ' . $mysqli->connect_error);
}
 
//get links to the 10 blog posts on this page
foreach ($html->find('h3 a') as $link) {
    //increment id for each document
    $i++;
 
    //get that url
    $url = $link->href;
 
    //create content variable
    $content = "";
 
    //now, get html for those blog posts
    $blog = file_get_html("$url");
 
    //get everything from div.entry
    foreach ($blog->find('div#content div.entry') as $p) {
 
        $content = $mysqli->real_escape_string($p);
 
        //insert!
        $insert = $mysqli->query("INSERT INTO rt (id, url, content, t_content) VALUES ($i, '$url', '$content')");
 
        //was it successful?
        if ($insert) {
            echo "Successful Insert";
        }
 
        else {
            printf("Errormessage: %s\n", $mysqli->error);
            ;
        }
    }
}
$mysqli->close();
?>

We’re using mysqli commands to open a connection to the RT index and Simple HTML DOM to dig through the HTML. Then we just send what we want to Sphinx with an insert statement. We loop through this until we’ve retrieved everything we’re looking for.

Very straightforward.

Other things to consider…

In actual practice, you would probably do a few things differently. You might use prepared statements. You might dump all the scraped text into a tsv, and then have Sphinx index that. You might also specify a bunch of url’s to scrape and traverse through them all. There are many things you might want to consider…

Search!

For my simple purposes, I just want these blog posts in a realtime index. Now, I can just start searching.

mysql> select id from rt where match('mysql');
+------+
| id   |
+------+
|    7 |
|    4 |
|   10 |
|    3 |
|    9 |
|    1 |
+------+
6 rows in set (0.01 sec)
 
mysql> select id from rt where match('mysql MAYBE docker');
+------+
| id   |
+------+
|    1 |
|    7 |
|    4 |
|   10 |
|    3 |
|    9 |
+------+
6 rows in set (0.00 sec)

You can see that 6 of our last 10 blog posts mentioned MySQL. When I threw “MAYBE docker” into the mix, document 1 jumped to the top of the result set. 1 contains ‘docker’, so it gets boosted. I also added weight() to show how to retrieve the weight calculated for each doc.

Zone Search

ZONE searching can also be useful in this context. Our documents are filled with html tags. We can narrow our search by choosing which zones (tags) to search. To do this, we just need to change a couple things in our configuration. Notice that I added 2 lines at the bottom (html_strip, and index_zones):

   index rt
{
    type = rt
    path = /var/lib/sphinx/data/rt
    rt_attr_string = url
    rt_field = content
    rt_attr_string = t_content
    html_strip = 1
    index_zones = p, a
}

I restarted my rt index to pick up the changes to the configuration. Now, searching only through specific tags is as easy as:

mysql> select id from rt where match('ZONE:(p) my*');
+------+
| id   |
+------+
|    3 |
|   10 |
|    7 |
|    4 |
|    1 |
+------+
5 rows in set (0.03 sec)
 
mysql> select id from rt where match('ZONE:(a) my* MAYBE doc*');
+------+
| id   |
+------+
|    1 |
|    3 |
|    4 |
+------+
3 rows in set (0.01 sec)

Nice.

The first search found documents containing words starting with ‘my’ in their ‘p’ ZONEs. By searching between ‘p’ tags, we’ve just searched the basic text of the blog posts. The second found documents that had words starting with ‘my’ and MAYBE words starting with ‘doc’ between ‘a’ tags. By the way, when searching between ‘a’ tags, we’re searching the post’s link text.

Other useful HTML options

  • html_index_attrs could also be useful in this context. With this option, you can get Sphinx to index certain tags you want to keep around, even though the others are being stripped.
  • And, html_remove_elements will also strip the element’s contents, so that everything between the opening and closing tags won’t be indexed. Use this to remove embedded scripts, CSS, etc. Short tag form for empty elements (ie. <br />) is properly supported; ie. the text that follows such tags will not be removed.

Conclusion

You can imagine the possibilities. Hopefully this simple example has sparked some ideas.

Happy Sphinxing!


« »

2 Responses to “Scrape, Index, and Search HTML”

  1. Dorin says:

    Why “html_strip” is applied to whole index / string attributes?

    Normally you should set this for specific attribute/field, like “content” etc. You may have at same time attributes that shouldn’t be parsed.

    Ideally i’d like to have Sphinx config in JSON/XML format, that allow hierarchy definitions and customization.

  2. adrian says:

    @Dorin: because that’s how it works now. You can add a feature request for introducing an option that allows selecting the fields on which the stripping should be applied.

Leave a Reply