Composite Index Keys and searching on secondary key values

In my time as a DBA, one of the most frequent misunderstandings I’ve found is how composite keys work within SQL Server, particularly relating to searches on the secondary key columns.

Consider the following table and index, a fairly standard link table between a customer and a product (in reality these would have foreign keys, but for the purposes of this post I’ve left these off):-

CREATE TABLE dbo.CustomerProduct
(CustomerID int NOT NULL
, ProductID int NOT NULL
CONSTRAINT pk_dbo_CustomerProduct PRIMARY KEY CLUSTERED (CustomerID, ProductID))

The ordering of the two key columns in the index is important: we first sort by CustomerID, and then, for each customer, we sort the ProductID column.  This means that if we want to search by ProductID alone, the only way in which we can be sure that we have found every ProductID is to perform a full table scan, as each individual CustomerID may have a row with that ProductID and there is no way of knowing this until we have searched every CustomerID.

In order to search effectively by ProductID, we need to build an index with the ProductID as the first column in the index key:-

CREATE UNIQUE NONCLUSTERED INDEX uidx_ProductID_CustomerID
ON dbo.CustomerProduct (ProductID, CustomerID)

Notice that in this case, I’ve included the CustomerID in the index key, and made the index unique.  This is because even with a non unique, nonclustered index we need to be able to uniquely identify each row of data so that we can get back to the actual row from the clustered index. So, to enable this, SQL Server will add the clustered index key to the non unique, nonclustered index’s key.  In this case, we already have one half of the clustered index key, the ProductID, so all we need to do is to add the CustomerID.  If we didn’t actually specify this, we would end up with exactly the same index being built, but the DDL for the index would not indicate this, potentially leading to duplicate indexes when considering what further indexes the table might need at a later point in time.  In the event that the clustered index itself is non unique, then that is made unique by adding a hidden 4 byte column called the “uniquifier” to the index key; this in turn is added to the nonclustered index key as mentioned above.  So, ultimately, all indexes are unique, whether we specify them to be so or not.

Using this new index, we can now successfully search by ProductID and perform a seek or a range scan against the index.  SQL Server will know that it has returned all rows that contain a specified ProductID value when it reaches the end of that range.

The second most common mistake I’ve seen with composite indexes, is including further columns in the index key after the final unique key column has been specified.  This applies since SQL Server 2005 and INCLUDED columns, as prior to this, we had to include these columns in the index key as the only way to create a covering index.

Imagine we need to add some more detail to the above table to show who created the record and when it was created, we would add the following columns:-

CREATE TABLE dbo.CustomerProduct
(CustomerID int NOT NULL
, ProductID int NOT NULL
, CreatedBy varchar(50) NOT NULL DEFAULT ORIGINAL_LOGIN()
, CreatedDate datetime2(3) NOT NULL DEFAULT GETUTCDATE()
CONSTRAINT pk_dbo_CustomerProduct PRIMARY KEY CLUSTERED (CustomerID, ProductID))

You might think that if we needed to search on ProductID, CustomerID and CreatedBy, we would add the CreatedBy column to the NonClustered index key we created earlier. However, as we know that CustomerID and ProductID as a pair are unique (because they are the primary key), there can only ever be a single row returned for each combination of that pair.  Adding the CreatedBy column to the index key therefore gives us no actual search benefit at all, as it is effectively unsorted data. Instead, adding this column bloats the root and intermediate pages in the index by up to 50 bytes per row.  With two integers in our key, we can store over 1000 rows in our root and intermediate pages.  If the CreatedBy column was added, and if that column happened to be filled to the 50 character limit, this would reduce to 138: we have just made the root and intermediate levels seven times larger than they actually need to be!  With small row counts this may not seem significant, but over large volumes this could have a serious effect to the number of pages required to store, and more important read from, the index.

Ultimately, we shouldn’t add any further columns to an index key once the key defined is unique.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s