Sybase iAnywhere Sybase iAnywhere

Rows and Columns


SQL Anywhere perspective on technology

header image

Full Text Searching a Wikipedia Database

May 12th, 2008 · 2 Comments

In my previous article, I explained how I downloaded an XML dump of the wikipedia articles and built a SQL Anywhere database with the data.
In order to test version 11, I created a full text index on the title and article data and populated the index:

CREATE TEXT INDEX "wikidex" ON "DBA"."wikipedia"
    ( "title","details" ) CONFIGURATION "SYS"."default_char" IMMEDIATE REFRESH

It took about 4 hours to build the index. Of note, the SQL Anywhere server temporary file grew to about 35G in size during the refresh process for the text index. This space is required in order to sort and generate the index contents. The database itself increased in size by about 4G after the index was created.

Once finished, I could start running queries. I ran some simple queries at first, like the following:

SELECT title, details FROM wikipedia WHERE contains(  details, "Canada" );

These queries were pretty fast (< 1sec) to return reults. As a comparison, a similar query using like to find the same results took several minutes to return (basically every row had to be scanned sequentially for the text):

SELECT title, details FROM wikipedia WHERE details LIKE "%Canada%";

One interesting thing to note in the result sets is that by changing the query slightly, you can get score data - basically this measures how closely the row matched the query string. I always sort on the column so I can see the most closely matched rows first:

SELECT ct.score, title, details FROM wikipedia contains(  details, "Canada" ) ct ORDER BY ct.score DESC;

Tags: Development

2 responses so far ↓

  • 1 Larry S // May 30, 2008 at 2:21 pm

    This sounds cool. Do the full text indexes impact dml on the indexed columns more than regular indexes?

  • 2 Biff // Jun 2, 2008 at 10:58 am

    The performance impact depends on the refresh setting for the index. SQL Anywhere has 3 options for refresh
    1) Immediate refresh (the default) - the index is updated whenever a DML statement changes an underlying value. Obviously, this will impact performance, and is generally recommended when the indexed columns are short or there is a requirement that they be kept up to date.
    2) Automatic refresh - the text index is refreshed automatically on a schedule that you can specify. This allows you to control when the refresh occurs, at the expensive of some potential staleness in the text index.
    3) Manual refresh - the index is only refreshed when you explicitly issue the refresh statement. This is useful when you rarely change the data, or you want to tie the index refresh to a specific event (eg. a data load process).

    In my case, I use manual, since I update the article data rarely, but in a large batch. I load the new article data, then issue the refresh afterwards.

Leave a Comment

Note that all comments are currently being moderated until I have a better handle on spam, so your comment may not appear for a couple of hours