Software Vendor Mistakes With SQL Server: String Splitting With Functions

Advent


Since SQL Server 2016, at least for databases in a similarly current compatibility level, the STRING_SPLIT function has been available. There were some problems with it initially that have recently been solved, too, like a lack of an “ordinal position” indicator, to show you when items occurred in a string.

But there’s still a bigger problem, since it’s sort of a Jack-Of-All-Splitters, you might not be getting the datatype back that you want. From the docs:

If the ordinal output column is not enabled, STRING_SPLIT returns a single-column table whose rows are the substrings. The name of the column is value. It returns nvarchar if any of the input arguments are either nvarchar or nchar. Otherwise, it returns varchar. The length of the return type is the same as the length of the string argument.

If you’re comparing the output to something that isn’t stringy, you might hit some weird implicit conversion issues.

Another significant problem is that SQL Server’s optimizer can’t see inside the string you pass in to generate any meaningful statistics information about the values in there.

This is generally true of many of the unpleasant T-SQL User Defined Functions I’ve seen written to split string prior to this, too.

Playtime


The big blaring headline here should be that if you’re going to do split any string and attempt to filter or join to it, you should put the parsed results into a #temp table first, with whatever indexing is appropriate. If you don’t do that, you’re gonna see all the screwy stuff that I mention below.

First, you only get a 50 row estimate no matter how many separate values are in the string. That might be fine for you, but it might not be fine for everyone.

Let’s start by throwing a real knuckleball to the function.

DECLARE 
    @x nvarchar(MAX) = N'';

SELECT 
    @x += 
        (
               SELECT DISTINCT
                   [text()] = 
                       ',' + RTRIM(P.OwnerUserId)
               FROM dbo.Posts AS P
               FOR XML 
                   PATH(''),
                   TYPE
        ).value
            (
                './text()[1]', 
                'nvarchar(max)'
            );

SELECT
    p.OwnerUserId,
    TotalScore = 
        SUM(p.Score * 1.)
FROM dbo.Posts AS P
WHERE P.OwnerUserId IN
(
    SELECT
        SS.value
    FROM STRING_SPLIT(@x, ',') AS SS
)
GROUP BY P.OwnerUserId
OPTION(RECOMPILE);

The code above takes every unique value in the OwnerUseId column in the Posts table and adds them to a comma separated list. Why didn’t I use the STRING_AGG function? I forgot. And after so many years, I have the XML equivalent memorized.

Even with a recompile hint on board, this is the query plan we get:

SQL Server Query Plan

 

Spending 8 seconds inside the splitter function feels pretty bad on its own, but then the entire query plan turns out crappy because of the 50 row estimate, and takes 11 minutes to finish in total. I spent 11 minutes of my life waiting for that to finish. Just for you.

Think about that.

Regularly


Under more normal circumstances, the lack of introspection (of the statistical variety) that you get from directly querying the splitter function can lead to a bad time.

I’m intentionally including a value with a fair amount of skew in the list-to-be-parsed, that leads to what might be a bad plan choice, based on that lack of introspection:

SELECT
    p.OwnerUserId,
    TotalScore = 
        SUM(p.Score * 1.)
FROM dbo.Posts AS P
WHERE P.OwnerUserId IN
(
    SELECT
        SS.value
    FROM STRING_SPLIT('1138,22656', ',') AS SS
)
GROUP BY P.OwnerUserId;
SQL Server Query Plan
zee problemmen!

We get a 50 row guess for two values, but one of those values has ~28k rows associated with it in the Posts table. Since there’s no stats on the internal elements of the string, we don’t get a good estimate for it at all

Big Awkward


If you, or one of your users, throws a malformed string into the mix, you might get some unexpected results.

On top of the values that we just passed in, I’m also going to pass in an empty string at the end after the final comma:

SELECT
    p.OwnerUserId,
    TotalScore = 
        SUM(p.Score * 1.)
FROM dbo.Posts AS P
WHERE P.OwnerUserId IN
(
    SELECT
        SS.value
    FROM STRING_SPLIT('1138,22656, ', ',') AS SS
)
GROUP BY P.OwnerUserId;

This time around, the results are a little different from above. The space at the end is implicitly converted from an empty string to the number 0, and things look like this:

SQL Server Query Plan
played

Most all of the bad guesses are retained as the previous plan, but since the number 0 accounts for a whole mess of rows too, things get a whole lot worse.

We go from 4.5 seconds to 30 seconds, and blah blah blah everything is horrible.

If you want to get around this, sure, you can add a filter to remove empty strings from the results of the splitter, but if you’re reading this, chances are you might not have the most defensive T-SQL coding practices in place. to begin with.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.