I was surfing the internet and found that you can download a dump of the wikipedia article data from Wikimedia
I was learning about the XML capabilities of SQL Anywhere, so I decided to download the XML dump of the article data and load it into a database. From there I could use the database to test the SQL Anywhere v11 (currently in beta) full text search functionality.
The XML file was about 12G of data. The format was pretty straight forward, so I decided I would use the OpenXML(…) function to simply read the data into a table in the database.
My table looked like this:
CREATE TABLE "DBA"."wikipedia" ( "article_id" bigint NULL ,"title" varchar(1000) NULL INLINE 256 PREFIX 8 ,"contributor" varchar(1000) NULL INLINE 256 PREFIX 8 ,"comments" long varchar NULL ,"details" long varchar NULL )
and my load script looked like this:
CREATE VARIABLE content long varchar; SELECT xp_read_file( '\wikipediadata\data\file1.xml' ) INTO content; INSERT INTO wikipedia( article_id, title, contributor, "comments", details ) SELECT * FROM OPENXML( content, '/mediawiki/page', 1, '<root xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance/"/>' ) WITH( article_id bigint 'id', title varchar(1000) 'title', contributor varchar(1000) './revision/contributor/username', comments long varchar './revision/comment', details long varchar './revision/text' );
The first problem I encountered was a limitation of SQL Anywhere. In SQL Anywhere, the maximum size of a string variable (including function/producedure arguments and character columns) is 2G. This meant before I could load the data I would have to break the 12G XML file into 2G pieces. I used a file split utility to break create a bunch of smaller XML files.
I then had to edit the beginning and end of each file to ensure each individual file was well formatted XML.
After that, I changed my load script above into several read/load statements and ran them. a couple of hours later I had all of the data loaded into the database. If the SQL Anywhere LOAD TABLE statement supported a clause that allowed me to specify the same information as
OpenXML, I could have used it instead, avoiding the requirement to split the large file into smaller components and presumably providing a much faster load than individual insert statements. I have added this to the list of future enhancements for SQL Anywhere.
In any case, now that I had the data loaded, I could view and query the data. However, the real goal is to play with the full text searching functionality in SQL Anywhere 11. I’ll talk about that in a future article.


1 response so far ↓
1 Full Text Searching a Wikipedia Database // May 12, 2008 at 11:31 am
[…] my previous article, I explained how I downloaded an XML dump of the wikipedia articles and built a SQL Anywhere […]
Leave a Comment
Note that all comments are currently being moderated until I have a better handle on spam, so your comment may not appear for a couple of hours