Sybase iAnywhere SQL AAnywhere Mobile and Embedded Database

I'd rather play golf


Thoughts on data management, autonomic computing, and self-managing database systems.

header image

Using regular expressions with SQL Anywhere

June 26th, 2009 · 3 Comments

SQL Anywhere version 11.0.0 introduced support for search conditions that include regular expression searching. There are two variants of regular expression search predicates that one can use, each with their own semantics: SIMILAR TO and REGEXP.

SIMILAR TO

The SIMILAR TO predicate is part of the 2008 ANSI/ISO SQL standard. However, the draft of the next SQL standard, due in the 2011 timeframe, is currently under development – in fact, a WG3 editing meeting is underway in Korea this week – and SIMILAR TO will likely be eliminated from subsequent versions of the SQL standard as its functionality is being replaced by the REGEXP_LIKE predicate (see below).

The syntax of the SIMILAR TO predicate is straightforward:

1
expression [ NOT ] SIMILAR TO pattern [ ESCAPE escape-expression ]

but as usual the devil is in the details. For starters, here’s an example, using the SQL Anywhere demo database:

2
3
4
SELECT *
FROM Customers
WHERE PostalCode NOT SIMILAR TO '([0-9]{5})|([0-9]{5}-[0-9]{4})|([A-Z][0-9][A-Z][[:whitespace:]]{1}[0-9][A-Z][0-9])'

which finds all those addresses with invalidly-formatted postal codes (either US or Canadian); the accepted codes have the formats of (a) five numbers (US), (b) five numbers, a dash, and four numbers (US), and (c) the six-character alphanumeric Canadian postal codes with a single embedded blank.

similarto

The regular expression patterns supported by SIMILAR TO, however, differ from those supported by regular expression conditions in other software packages (such as Perl). Here is a small, non-exhaustive list of differences (a more exhaustive list is contained in the SQL Anywhere documentation):

  • As with LIKE and REGEXP, SIMILAR TO matches entire values, not portions of values.
  • SIMILAR TO uses “%” (percent) and “_” (underscore) as wildcard characters, in the same way as LIKE. One uses “%” instead of “.*”.
  • SIMILAR TO doesn’t support a variety of sub-character classes, such as [:ascii:], [:blank:], or [:punct:].
  • Perhaps most importantly, SIMILAR TO uses collation-based comparisons when comparing string values. This can be useful. For example, with SQL Anywhere’s default case-insensitive string matching, the pattern [A]{1} is equivalent to [a]{1}, and these equivalences may also apply to accented characters with specific collations. However, a significant drawback is that range patterns don’t work properly; the range pattern [A-C] does not, in fact, match only the upper case characters A, B, and C. Rather, in the default case-insensitive collation [A-C] matches any of the characters A, B, b, c and C; it does not match “a” because the character “a” precedes “A” in the collation sequence.


    This means, then, that the example above fails to properly validate Canadian postal codes; the query would accept Canadian postal codes containing lower-case letters.

REGEXP

With the SQL Anywhere 11.0.1 release, the REGEXP predicate supports regular expression patterns in a manner similar to Perl and other UNIX-based tools that support regular expression searching. Once again, the syntax is straightforward:

5
expression [ NOT ] REGEXP pattern [ ESCAPE escape-expression ]

In the SQL Standard, the syntax is virtually identical except that the predicate uses the keyword LIKE_REGEXP. Supported patterns are those from the XQuery portion of the standard. In SQL Anywhere, we’ve adopted pattern syntax from a variety of sources, primarily Perl. REGEXP does not use collation-based matching; matching is based on code point values in the database character set. For example, the comparison X REGEXP '[A-C]', for the single character X, is equivalent to CAST(X AS BINARY) >= CAST(A AS BINARY) AND CAST(X AS BINARY) <= CAST(C AS BINARY).

REGEXP supports the common meta-characters and sub-classes familiar to programmers, and also supports special escaped characters such as "\s" for a space, or "\r" for carriage return, and look-ahead and look-behind assertions. Here is the same example for validating postal codes, but this time using REGEXP:

6
7
8
SELECT *
FROM Customers
WHERE PostalCode NOT REGEXP '([0-9]{5})|([0-9]{5}-[0-9]{4})|([A-Z][0-9][A-Z]\s[0-9][A-Z][0-9])'

Finally, note that the SQL Anywhere query optimizer will automatically optimize REGEXP and SIMILAR TO predicates - as it does for LIKE predicates - to be used as sargable predicates for index scans, depending on the specific pattern.

Tags: Query optimization · SQL Anywhere · SQL Standard

3 responses so far ↓

  • 1 Loading data in more flexible ways - part un // Jul 6, 2009 at 8:43 pm

    [...] to a DATE, I used the REGEXP_SUBSTR function, another new feature of Version 11 that accompanies SQL Anywhere’s regular expression support. Here, the first usage of REGEXP_SUBSTR returns the truncated, numeric, day of the month, using a [...]

  • 2 Michelle McGaughey // Mar 11, 2010 at 6:36 am

    I’ve tried to use REGEXP within a CHECK statement on a domain to ensure that a barcode conforms to the correct format, but I keep getting a syntax error. Can it not be used here?

    CHECK ( REGEXP '([0-9]{8}) | ([0-9]{7} X)' );

  • 3 Glenn Paulley, Director, Engineering // Mar 11, 2010 at 8:44 am

    Hi Michelle,

    When using a CHECK constraint with a DOMAIN definition, you need to specify the complete search condition even though the column name isn’t known (since it’s within a DOMAIN). In SQL Anywhere, you do this by using a placeholder variable, which begins with an ‘@’ sign, which is substituted with the name of the column when the DOMAIN is subsequently referenced in a CREATE TABLE statement:

    1
    2
    
    CREATE DOMAIN barcode CHAR( 35 ) NULL 
      CHECK( @barcode REGEXP '([0-9]{8}) | ([0-9]{7} X)' );

    Here, I’ve used the placeholder variable ‘@barcode’ which will be replaced by corresponding column name when used in a table definition.