SQL Server – Another reason why using SELECT * is bad for performance

I think it’s fair to say that most database professionals will assert that SELECT * is bad practice.  For lots of reasons: code maintainability, it’s lazy, prevents being able to properly use covering indexes.  However, a recent Linked In post by Mike West made me look into whether there was any performance concerns when using SELECT * versus selecting a column list when there aren’t any supporting covering indexes.

Obviously, there is absolutely no difference between selecting every column explicitly and using the SELECT * syntax.  However, what if we only had a clustered index, meaning that the number of page reads and our buffer resource requirements would be identical, but we only selected a subset of those columns?  My instincts tell me that the latter would win, for several reasons:-

  • Less data to return to the client.
  • Less actual processing of the page data on the SQL Server.
  • Less actual processing of the received data by the client.

To prove this, I’m using the following table; it’s a bit ridiculous but serves as a good example in this case.

/* Create a wide table */

CREATE TABLE #WideTable
(ID INT IDENTITY(1,1) PRIMARY KEY
, SmallColumn CHAR(1) NOT NULL
, LargeColumn CHAR(8000) NOT NULL)
GO
/* populate the wide table */

INSERT INTO #WideTable
		(SmallColumn,LargeColumn)
SELECT TOP 10000 'X', 'Y'
FROM sys.columns AS C
CROSS JOIN sys.columns
GO

This gives us a ten thousand row table, with one very large column to add a substantial amount of bloat. Note that we only have a clustered primary key, no covering indexes so there’s no cheating going on here, I’ve then run a SELECT * on the table to ensure everything’s in buffer so nothing will be skewing the results.

Also, to confirm that there’s nothing funky going on with row overflow data, and to confirm everything is held at the page level, you can run the following statement to see that we just have IN_ROW_DATA in our table:-

USE Tempdb
GO

SELECT t.name
, p.partition_id
, p.hobt_id
, au.type
, au.type_desc
FROM sys.partitions AS P
INNER JOIN sys.tables AS T
	ON T.object_id = P.object_id
INNER JOIN sys.allocation_units AS AU
	ON au.container_id = P.hobt_id
WHERE t.name like '#WideTable%'

As well as STATISTCS IO and STATISTICS TIME, I’m also going to make use of the “Show Client Statistics” feature of SSMS, this will show me some additional information such as the amount of data transmitted to the client.  To turn it on, you need to click this little button on your SSMS:-

EnableClientStatistics

Then, I’m going to run the two queries side by side, to see how they perform.  I’m expecting the logical reads to be identical, but I’m curious about the CPU time required, and the output of the client statistics to show how much data is transmitted over the network to the client.  My gut says there’s going to be a big difference between the two!

/* turn on our statistics gathering */

SET STATISTICS IO ON
SET STATISTICS TIME ON

/* make sure to turn on show client side statistics at this point */

SELECT *
FROM #WideTable AS WT

SELECT ID
, SmallColumn
FROM #WideTable AS WT

The results from the statistics IO and statistics TIME are pretty interesting:-

SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

(10000 row(s) affected)
Table ‘#WideTable__________________________________________________________________________________________________________0000000003FE’. Scan count 1, logical reads 10018, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 125 ms, elapsed time = 6975 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

(10000 row(s) affected)
Table ‘#WideTable__________________________________________________________________________________________________________0000000003FE’. Scan count 1, logical reads 10018, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 10 ms.

As expected, the number of page reads are identical, but the CPU and elapsed time are miles different! The first query, returning some eighty million more bytes adds seven seconds onto our query time!

I’m now going to run the two queries separately, to compare the client side statistcs. The first query with select * will be trial 1, and the second will be trial 2.

ClientStatistics

Again, you can see the difference in the amount of data sent to the client, and the massive increase in client processing time that this causes. If there were network issues, this would also be more apparent with the select * case due to the significant increase in TDS packets required for our first query.

So, I think it’s another fair conclusion that using SELECT * is bad practice, not least for code maintainability but also in a performance sense – if you don’t need all those columns, don’t select them!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s