SQL Server CLR Functions using Regular Expressions

I think it’s a well known fact that string handling within SQL Server is a little, lacking, to say the least.  However, SQL Server does allow you to create CLR procedures and functions, which allow you to use the power of programming languages such as VB, C# and so on.

.NET includes the RegEx class, which allows you to use regular expressions for string matching, replacing and so on.  I’ve previously used this in powershell scripts to facilitate replacing text in script files en-masse.  Now, I’m not very au fait with C#, the usual extent of my coding is for script tasks in SSIS.  However, I felt it was about time that I start learning a little more and was recently presented with an excellent opportunity to do so: I needed to clean up a table with a notes column, where the notes had previously allowed all kinds of control characters to be inserted.

The way to do this in native T-SQL is using the REPLACE function.  However, REPLACE does not deal with wildcards; its a literal replace.  I also had a requirement to strip down any set of consecutive spaces to a single space (e.g. changing a string like “this     string   has    too    many        arbitrary    spaces”, to “this string has too many arbitrary spaces”. I needed to be able to replace any instance of a set of characters within my column, so the only way to do this is with a horrible nested REPLACE statement, and I still had to deal with the multiple consecutive spaces in a slightly creative way; by replacing all spaces with a pair of characters such as “<>”, then replacing the inverse (“><“) with nothing, and finally replacing the remaining “<>” string with a space.  This seems like a lot of hoop jumping.

The final REPLACE statement looks something like this:-

LTRIM(RTRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(ISNULL(Notes,''),CHAR(1), ''), CHAR(2), ''), CHAR(3), ''), CHAR(4), ''),CHAR(5), ''), CHAR(6), ''), CHAR(7), ''), CHAR(8), ''), CHAR(9), ''), CHAR(10), ''), CHAR(11), ''),CHAR(12), ''), CHAR(13), ''), CHAR(14), ''), CHAR(15), ''), CHAR(16), ''), CHAR(17), ''),CHAR(18), ''), CHAR(19), ''), CHAR(20), ''), CHAR(21), ''), CHAR(22), ''), CHAR(23), ''),CHAR(24), ''), CHAR(25), ''), CHAR(26), ''), CHAR(27), ''), CHAR(28), ''),CHAR(29), ''), CHAR(30), ''), CHAR(31), ''),' ','<>'),'><',''),'<>',' ')))

Let’s face it, that is not pretty, or very readable.  Or, for that matter, particularly quick.

.NET RegEx makes a mockery of the above statement.  I’m using two methods, the IsMatch method and the Replace method.  The syntax is so simple, I don’t really think it even needs much explaining:-

using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;

namespace SQLServerRegExFunctions
{
    public class RegExStringFunctions
    {
        [SqlFunction(IsDeterministic=true,IsPrecise=true,DataAccess=DataAccessKind.None)]
        public static bool RegExIsMatch(string StringToParse, string RegEx)
        {
             return Regex.IsMatch(StringToParse, RegEx);
        }

        [SqlFunction(IsDeterministic = true, IsPrecise = true, DataAccess = DataAccessKind.None)]
        public static string RegExReplace(string StringToParse, string RegEx, string ReplacementText)
        {
            return Regex.Replace(StringToParse, RegEx, ReplacementText);
        }
    }
}

I’ve included the portions of code to declare this as a SqlFunction to allow me to create my CLR function in SQL. Basically, the two functions will pass in the string we’re checking, the regular expression to match against, and in the case of the replace function, the string we want to replace our matches with.

The code to create the CLR function is below (assuming you’ve compiled your dll to your local C drive):-

/* Load the assembly */
/* drop object if it already exists */
IF EXISTS (SELECT 1
			FROM sys.objects AS ob
			INNER JOIN sys.schemas AS sc
			ON ob.schema_id = sc.schema_id
			WHERE ob.name = 'udfRegExIsMatch'
			AND sc.name = 'dbo')
	DROP FUNCTION dbo.udfRegExIsMatch;

/* drop object if it already exists */
IF EXISTS (SELECT 1
			FROM sys.objects AS ob
			INNER JOIN sys.schemas AS sc
			ON ob.schema_id = sc.schema_id
			WHERE ob.name = 'udfRegExReplace'
			AND sc.name = 'dbo')
	DROP FUNCTION dbo.udfRegExReplace;

/* Drop the assembly */
/* drop object if it already exists */
IF EXISTS (SELECT 1
			FROM sys.assemblies AS A
			WHERE a.name = 'SQLServerRegExFunctions')
	DROP ASSEMBLY SQLServerRegExFunctions;
GO

/* create the assembly */
CREATE ASSEMBLY SQLServerRegExFunctions
FROM 'C:\SQLServerRegExFunctions.dll'
WITH PERMISSION_SET = SAFE

/* Create the functions from the assembly */
GO

CREATE FUNCTION udfRegExIsMatch(@StringToParse nvarchar(4000), @RegEx nvarchar(255))
RETURNS bit
AS
External name SQLServerRegExFunctions.[SQLServerRegExFunctions.RegExStringFunctions].RegExIsMatch
GO

CREATE FUNCTION udfRegExReplace(@StringToParse nvarchar(4000), @RegEx nvarchar(255), @ReplacementText NVARCHAR(4000))
RETURNS NVARCHAR(4000)
AS
External name SQLServerRegExFunctions.[SQLServerRegExFunctions.RegExStringFunctions].RegExReplace
GO

To run the function, the code would look something like this to replace all those spaces, along with the comparison using the T-SQL method:-

DECLARE @RegEx VARCHAR(10) = '[ ]+'

DECLARE @TableOfStrings TABLE
(Notes VARCHAR(MAX))

INSERT INTO @TableOfStrings
(Notes)
SELECT 'This string has too many spaces sometimes.'
FROM sys.columns

SET STATISTICS TIME ON
/* New version of the replace many spaces using CLR*/
SELECT dbo.udfRegExReplace(Notes,@RegEx,' ')
FROM @TableOfStrings AS TOS

/* old version of the repace using native T-SQL*/
SELECT REPLACE(REPLACE(REPLACE(notes, ' ', '<>'),'><',''),'<>',' ')
FROM @TableOfStrings AS TOS

The results from the STATISTICS TIME call show how much more efficient the CLR method is in comparison to the T-SQL method:-

SQL Server Execution Times:
CPU time = 47 ms, elapsed time = 40 ms.

SQL Server Execution Times:
CPU time = 203 ms, elapsed time = 210 ms.

40ms as opposed to 210ms; think that’s a clear win! This also gets progressively worse on the SQL side for each additional REPLACE added; for the data set I was working with, which contained some 12 million rows, the difference resulted in a query that ran in a couple of minutes as opposed to a couple of hours!

Advertisements

3 thoughts on “SQL Server CLR Functions using Regular Expressions

  1. Good Post. Very Useful indeed. However, couldn’t you have done the replace for consecutive spaces with only two replaces, nested, and both of them replacing two spaces ‘ ‘ with one ‘ ‘.

    Liked by 1 person

    • Hi Mandar,

      Unfortunately not, try the following code and you’ll see why:-

      DECLARE @SpacesString VARCHAR(1000) = ‘string with two’ + REPLICATE(CHAR(32),2) + ‘spaces and’ + REPLICATE(CHAR(32),3) + ‘three spaces and’ + REPLICATE(CHAR(32),4) + ‘four spaces and’ + REPLICATE(CHAR(32),5) + ‘five spaces’ + REPLICATE(CHAR(32),6) + ‘and six spaces’

      SELECT @SpacesString
      , REPLACE(REPLACE(@SpacesString,CHAR(32) + CHAR(32) , CHAR(32)), CHAR(32) + CHAR(32) , CHAR(32)) SQLReplaceStatements

      The replace statement will only replace occurrences of two spaces together with one: so if you have three spaces together, it will replace the first two with one space and do nothing with the final space.

      As you get more and more consecutive spaces, then this will carry on occurring, regardless of the number of nested replace statements you do.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s