When to create Indexes on Temp Tables

Creating indexes on Temp Tables sometimes appear to be a contentious issue. Amongst a number of things, they prevent Temporary Table caching, as described by Paul White in his blog post on Temp Table Caching Explained.

However, if we have a Temp Table with a large volume of data, and are then going to be manipulating subsets of this data, indexes are going to be beneficial in the same way as a “normal” table: they allow SQL Server to perform seek operations instead of just table scans, which can drastically affect the performance of your queries.

A complaint that I’ve often heard is “surely the processing required to create an index on a temp table outweighs the performance gains from any subsequent queries using that temp table?” So, I wanted to look at the cost of creating indexes on Temp Tables, and whether it was better to create the index prior to population, or after. So, to demonstrate, I’m going to create a couple of tables; one to hold three books that I’m going to send out to some customers, the other to do the mapping between the customer and the book (personally, I’d be hoping for the panda fact book….) DDL to create the table and populate it with some data:-

--Temp tables need index love too!

--clean up tables

IF OBJECT_ID('Tempdb..#Book')
	IS NOT NULL
DROP TABLE #Book

IF OBJECT_ID('Tempdb..#CustomerBook')
	IS NOT NULL
DROP TABLE #CustomerBook

--create a temp table to hold our books
CREATE TABLE #Book
(BookID int IDENTITY(1,1) NOT NULL
, BookTitle varchar(255)
, BookAuthor varchar(255))

--insert a few books facts
INSERT INTO #Book
        (BookTitle,
		BookAuthor)
VALUES
('100 Facts About Pandas', 'David O''Doherty, Claudia O''Doherty & Mike Ahern' )
, ('1984', 'George Orwell')
, ('An Astronaut''s Guide TO Life On Earth', 'Chris Hadfield')

--create a table to hold the ids of customer ids who are going to be sent that particular book
CREATE TABLE #CustomerBook
(CustomerID int IDENTITY(1,1)
, BookID int)

INSERT INTO #CustomerBook
        (BookID
        )
SELECT CustomerID % 3 + 1
FROM	(
		SELECT TOP 100000 ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS CustomerID
		FROM syscolumns c1
		CROSS JOIN syscolumns c2
		) c

We should now have two tables populated, one with 3 rows (our book table) and one with 100,000 customers, evenly split between the three books. Note that I haven’t created any indexes on these tables yet. I’m going to run a few queries, to see how this performs.

--turn our metrics gathering on
SET STATISTICS IO ON
SET STATISTICS TIME ON
SET STATISTICS XML ON

--Get the Book Title that each customer will receive
SELECT b.BookID
, b.BookTitle
, cb.CustomerID
FROM #Book AS b
INNER JOIN  #CustomerBook AS cb
	ON b.BookID = cb.BookID

The query plan unsurprisingly shows us two table scans, with the inputs being Hash joined:-

The little warning triangle is also suggesting a missing index:-

/*
Missing Index Details from 
The Query Processor estimates that implementing the following index could improve the query cost by 82.8996%.

WARNING: This is only an estimate, and the Query Processor is making this recommendation based solely upon analysis
of this specific query. It has not considered the resulting index size, or its workload-wide impact, including its
impact on INSERT, UPDATE, DELETE performance. These factors should be taken into account before creating this index.
*/
USE [tempdb]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[#CustomerBook_______________________________________________________________________________________________________00000001CBA5] ([BookID])
INCLUDE ([CustomerID])
GO

Meanwhile, the results from the STATISTICS IO/TIME give us the following output (I’m using StatisticsParser to get a nicer reading of the output):-

At this point, I’m not that fussed about the table scans, after all, we’re returning all rows for each table. The Hash match might be overkill though. Let’s try some more queries before we go ahead and stick the suggested index on:-

--let's try a query that returns a subset of that data:-
SELECT b.BookID
, b.BookTitle
, cb.CustomerID
FROM #Book AS b
INNER JOIN  #CustomerBook AS cb
	ON b.BookID = cb.BookID
WHERE b.BookID = 1

Again, unsurprisingly, everything is being table scanned (there are no indexes so there’s no option to anything but this) and despite only returning a third of the rows our I/O is exactly the same as the previous query. The query plan is also recommending the same index. If we were to run more selective queries, the result would be the same: table scan of the CustomerBook table, and the full 212 pages being read every single time we queried it. The table is behaving in exactly the same way as a “normal” table.

So: I think it’s fair to say, we could do with an index on the BookID column on the CustomerBook table in order to prevent it being table scanned each time we read anything from the table. I pretty much agree with the index suggested so I’m going to give it a proper name and go ahead and create it. I’m also going to add a primary key on the CustomerID, as I know this is a unique value. At this point, I’m now going to look at whether we would do better creating the table and indexes all in one, or creating them afterwards. Let’s try with all the DDL done first:-

IF OBJECT_ID('Tempdb..#CustomerBook')
	IS NOT NULL
DROP TABLE #CustomerBook  

--create the table with a primary key defined
CREATE TABLE #CustomerBook
(CustomerID int IDENTITY(1,1) PRIMARY KEY
, BookID int)

--create the nonclustered index on an empty table
CREATE NONCLUSTERED INDEX clix_BookID
ON #CustomerBook (BookID)
INCLUDE (CustomerID)

INSERT INTO #CustomerBook
        (BookID
        )
SELECT CustomerID % 3 + 1
FROM	(
		SELECT TOP 100000 ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS CustomerID
		FROM syscolumns c1
		CROSS JOIN syscolumns c2
		) c

The plan is much more complex than before because we have the selection included, you’ll need to click the image below to zoom in:-

TempTables_Insert1Plan

The first section on the plan, up to the insert is the result of the selection of data from syscolumns, the production of the row numbers and the computation to split the newly created CustomerIDs into three buckets to apply the BookID. The expensive part looks like it’s the sort prior to the non clustered index insert:-

TempTables_Insert1PlanSnippet

If we look at the statistics output, this also confirms it – we end up doing a load of reads (401,915) from the temp table:-

TempTables_Insert1Statistics

Ok, so that seems quite expensive; I’m not quite sure why this is such an inefficient sort, if anyone is able to shed some light on this, I’d very much appreciate it – something for some further research! Anyway, let’s try creating the index and primary key afterwards:-

IF OBJECT_ID('Tempdb..#CustomerBook')
	IS NOT NULL
DROP TABLE #CustomerBook  

CREATE TABLE #CustomerBook
(CustomerID int IDENTITY(1,1)
, BookID int)

INSERT INTO #CustomerBook
        (BookID
        )
SELECT CustomerID % 3 + 1
FROM	(
		SELECT TOP 100000 ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS CustomerID
		FROM syscolumns c1
		CROSS JOIN syscolumns c2
		) c
  
ALTER TABLE #CustomerBook ADD CONSTRAINT pk_#CustomerBook PRIMARY KEY (CustomerID)      

CREATE NONCLUSTERED INDEX clix_BookID
ON #CustomerBook (BookID)
INCLUDE (CustomerID)

This runs a little quicker, and obviously generates three query plans for the three statements. The first plan looks pretty much the same as the plan we had previously, except the insert operator is a table operator instead of a clustered index insert.

TempTables_Insert2Plan1

The index inserts pretty much have exactly the same plan, just with slight nuances showing that the first involves a table scan and sort to populate the PK and the second a clustered index scan and sort:-

TempTables_Insert2Plan2

TempTables_Insert2Plan3

Finally, the statistics look much better; instead of having the crazy number of page reads for the non clustered index sort, we just have the 212 reads for each scan. Here are the cumulative stats for the entire operation:-

TempTables_Insert2Statistics

This looks much better. However, I’m still not convinced that we need to do the PK afterwards; so let’s try this before population, and the index after:-

IF OBJECT_ID('Tempdb..#CustomerBook')
	IS NOT NULL
DROP TABLE #CustomerBook  

CREATE TABLE #CustomerBook
(CustomerID int IDENTITY(1,1) PRIMARY KEY
, BookID int)

INSERT INTO #CustomerBook
        (BookID
        )
SELECT CustomerID % 3 + 1
FROM	(
		SELECT TOP 100000 ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS CustomerID
		FROM syscolumns c1
		CROSS JOIN syscolumns c2
		) c

CREATE NONCLUSTERED INDEX clix_BookID
ON #CustomerBook (BookID)
INCLUDE (CustomerID)

This looks to be the winner, there are just two elements to the plan, the PK insert below and the NCI insert (not shown as it’s identical to the previous one). And, the statistics show that the number of reads is the lowest of all three attempts:-

TempTables_Insert3Plan

TempTables_Insert3Statistics

OK, so, adding the PK first definitely looks to be “free”. But is adding the nonclustered index after always the best plan? This depends on the cost between the sort of the insert during the combined operation, and the cost of resorting the temptable afterwards. Let’s change things up and make our temp table much bigger than it currently is, and insert the table from our pre existing temp table:-

IF OBJECT_ID('Tempdb..#BigIntCustomerBook') IS NOT NULL
	DROP TABLE #BigIntCustomerBook

CREATE TABLE #BigIntCustomerBook
(CustomerID bigint NOT NULL
, BookID bigint
, BigDefaultedColumn char(2000) NOT NULL DEFAULT ('lets waste some space'))

INSERT INTO #BigIntCustomerBook
        (CustomerID,
		BookID
        )
SELECT CustomerID
, BookID FROM #CustomerBook

ALTER TABLE #BigIntCustomerBook
ADD CONSTRAINT pk_#BigIntCustomerBook PRIMARY KEY (CustomerID)

CREATE NONCLUSTERED INDEX clix_BookID
ON #BigIntCustomerBook (BookID)
INCLUDE (CustomerID)

Oh dear, now the reads for the sort from the #BigInteCustomerBook table are much bigger, and the elapsed time is much greater:-

TempTables_BigInsert1Stats

If we create the index first, how does it look?

IF OBJECT_ID('Tempdb..#BigIntCustomerBook') IS NOT NULL
	DROP TABLE #BigIntCustomerBook

CREATE TABLE #BigIntCustomerBook
(CustomerID bigint NOT NULL PRIMARY KEY
, BookID bigint
, BigDefaultedColumn char(2000) NOT NULL DEFAULT ('lets waste some space'))

CREATE NONCLUSTERED INDEX clix_BookID
ON #BigIntCustomerBook (BookID)
INCLUDE (CustomerID)

INSERT INTO #BigIntCustomerBook
        (CustomerID,
		BookID
        )
SELECT CustomerID
, BookID FROM #CustomerBook

This time, the reads are still really high for the sort operation, but the elapsed time is much less, so would suggest that the larger the target table, the higher the chance that it is beneficial to index first.

TempTables_BigInsert2Stats

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s