4. Estimating Rows Using Index Statistics
How does the Query Optimizer use the index statistics to estimate the number of rows that match the SARGs in a query?
SQL Server uses the histogram information when
searching for a known value being compared to the leading column of the
index key column, especially when the search spans a range or when there
are duplicate values in the key. Consider this query on the sales table in the bigpubs2008 database:
select * from sales
where title_id = 'BI3976'
Because there are duplicates of title_id in the table, SQL Server uses the histogram on title_id (refer to Listing 34.5) to estimate the number of matching rows. For the value of BI3976, it would look at the EQ_ROWS value, which is 260.778. This indicates that there are approximately 261 rows in the table that have a title_id value of BI3976.
When an exact match for the search argument is not found as a step in the histogram, SQL Server uses the AVG_RANGE_ROWS value for the next step greater than the search value. For example, SQL Server would estimate that for a search value of 'BI4184', on average, it would >match approximately 300.0652 rows because that is the AVG_RANGE_ROWS value for the step value of 'BI8448', which is the next step greater than 'BI3976'.
When the query is a range retrieval that spans multiple steps, SQL Server sums the RANGE_ROWS and EQ_ROWS values between the endpoints of the range retrieval. For example, when we use the histogram in Listing 34.5, if the search argument were where title_id <= 'BI3976', the row estimate would be 274.8199+639.6047+312.9337+893.1208+ 271.811+637.2789+260.778, or 3290.3470 rows.
As mentioned previously, when the histogram cannot be
used, SQL Server uses just the index density to estimate the number of
matching rows. The formula is straightforward for an equality search; it
looks like this:
Row Estimate = Number of Rows in Table × Index Density |
For example, to estimate the number of matching rows for any given title_id in the sales table, multiply the number of rows in the sales table by the index density for the title_id key (0.001862197), as follows:
select count(*) * 0.001862197 as 'Row Estimate'
from sales
go
Row Estimate
-------------------
314.199188825
If a query specifies both the title_id and stor_id as SARGs, and if the SARG for title_id is a constant expression that can be evaluated at optimization time, SQL Server uses both the index density on title_id and stor_id as well as the histogram on title_id to estimate the number of matching rows. For some data values, the estimated number of matching rows for title_id and stor_id calculated using the index density could be greater than the estimated number of rows that match the specific title_id, as determined by the histogram. SQL Server uses whichever is the smaller of the two to calculate the row estimate.
Multiplying the number of rows in the sales table by the index density for title_id, stor_id (5.997505E-06), you can see that it is nearly unique, essentially matching only a single row:
select count(*) * 5.997505E-06 as 'Row Estimate'
from sales
Row Estimate
--------------
1.011929031125
In this example, SQL Server would use the index density on title_id and stor_id
to estimate the number of matching rows. In this case, it is estimated
that the query will return, on average, one matching row.
5. Generating and Maintaining Index and Column Statistics
At
this point, you might ask, “How do the index statistics get created?”
and “How are they maintained?” The index statistics are first created
when you create the index on a table that already contains data rows or
when you run the UPDATE STATISTICS command. Index statistics
can also be automatically updated by SQL Server. SQL Server can be
configured to constantly monitor the update activity on the indexed key
values in a database and update the statistics through an internal
process, when appropriate.
Auto-Update Statistics
To automatically update statistics, an internal SQL
Server process monitors the updates to a table’s columns to determine
when statistics should be updated. SQL Server internally keeps track of
the number of modifications made to a column via column modification
counters (colmodctrs). SQL Server uses information about the table and the colmodctrs
to determine whether statistics are out of date and need to be updated.
Statistics are considered out of date in the following situations:
When the table size has gone from 0 to >0 rows
When the number of rows in the table at the time the statistics were gathered was 500 or fewer and the colmodctr of the leading column of the statistics object has changed by more than 500
When the table had more than 500 rows at the time the statistics were gathered and the colmodctr of the leading column of the statistics object has changed by more than 500 + 20% of the number of rows in the table
If the statistics are defined on a temporary table,
there is an additional threshold for updating statistics every six
column modifications if the table contains fewer than 500 rows.
The colmodctrs are incremented in the following situations:
When a row is inserted into the table
When a row is deleted from the table
When an indexed column is updated
Whenever the index statistics have been updated for a column, the colmodctr for that column is reset to 0.
When SQL Server generates an update of the column
statistics, it generates the new statistics based on a sampling of the
data values in the table. Sampling helps minimize the overhead of the
AutoStats process. The sampling is random across the data pages, and the
values are taken from the table or the smallest nonclustered index on
the columns needed to generate the statistics. After a data page
containing a sampled row has been read from disk, all the rows on the
data page are used to update the statistical information.
Caution
Having up-to-date statistics on tables helps ensure
that optimum execution plans are being generated for queries at all
times. In most cases, you would want SQL Server to automatically keep
the statistics updated. However, it is possible that Auto-Update
Statistics can cause an update of the index statistics to run at
inappropriate times in a production environment or in a high-volume
environment to run too often. If this problem is occurring, you might
want to turn off the AutoStats feature and set up a scheduled job to
update statistics during off-peak periods. Do not forget to update
statistics periodically; otherwise, the resulting performance problems
might end up being much worse than the momentary ones caused by the
AutoStats process.
To determine how often the AutoStats process is being
run, you can use SQL Server Profiler to determine when an automatic
update of index statistics is occurring by monitoring the Auto Stats event in the Performance event class.
If necessary, it is possible to turn off the AutoStats behavior by using the sp_autostats
system stored procedure. This stored procedure allows you to turn the
automatic updating of statistics on or off for a specific index or all
the indexes of a table. The following command turns off the automatic
update of statistics for an index named aunmind on the authors table:
Exec sp_autostats 'authors', 'OFF', 'aunmind'
When you run sp_autostats and simply supply
the table name, it displays the current setting for the table as well as
the database. Following are the settings for the authors table:
Exec sp_autostats 'authors'
go
Global statistics settings for [bigpubs2008]:
Automatic update statistics: ON
Automatic create statistics: ON
settings for table [authors]
Index Name AUTOSTATS Last Updated
------------------------ --------- ------------------------
[UPKCL_auidind] ON 2009-10-19 01:23:47.263
[aunmind] OFF 2010-03-14 22:20:52.177
[_WA_Sys_state_4AB81AF0] ON 2009-10-19 01:23:47.263
[au_fname] ON 2009-10-19 01:23:47.280
[phone] ON 2009-10-19 01:23:47.293
[address] ON 2009-10-19 01:23:47.310
[city] ON 2009-10-19 01:23:47.310
[zip] ON 2009-10-19 01:23:47.310
There are three other ways to disable auto-updating of statistics for an index:
Specify the STATISTICS_NORECOMPUTE clause when creating the index.
Specify the NORECOMPUTE option when running the UPDATE STATISTICS command.
Specify the NORECOMPUTE option when creating statistics with the CREATE STATISTICS command.
You can also turn AutoStats on or off for the entire
database by setting the database option in SQL Server Management Studio;
to do this, right-click the database in Object Explorer to bring up the
Database Properties dialog, select the Options page, and set the Auto
Update Statistics option to False. You can also disable or enable the
AutoStats option for a database by using the ALTER DATABASE command:
ALTER DATABASE dbname SET AUTO_UPDATE_STATISTICS { ON | OFF }
Note
What actually happens when you execute sp_autostats or use the NORECOMPUTE option in the UPDATE STATISTICS
command to turn off auto-update statistics for a specific index or
table? SQL Server internally sets a flag in the system catalog to inform
the internal SQL Server process not to update the index statistics for
the table or index that has had the option turned off using any of these
commands. To re-enable Auto Update Statistics, you either run UPDATE STATISTICS without the NORECOMPUTE option or execute the sp_autostats system stored procedure and specify the value 'ON' for the second parameter.
Asynchronous Statistics Updating
In versions prior to SQL Server 2005, when SQL Server
determined that the statistics being examined to optimize a query were
out of date, the query would wait for the statistics update to complete
before compilation of the query plan would continue. This is still the
default behavior in SQL Server 2008. However, the database option, AUTO_UPDATE_STATISTICS_ASYNC, can be enabled to support asynchronous statistics updating.
When the AUTO_UPDATE_STATISTICS_ASYNC option
is enabled, queries do not have to wait for the statistics to be
updated before compiling. Instead, SQL Server puts the out-of-date
statistics on a queue to be updated by a worker thread, which runs as a
background process. The query and any other concurrent queries compile
immediately by using the existing out-of-date statistics. Because there
is no delay for updated statistics, query response times are more
predictable, even if the out-of-date statistics may cause the Query
Optimizer to choose a less-efficient query plan. Queries that start
after the updated statistics are ready use the updated statistics.
Manually Updating Statistics
Whether or not you’ve disabled AutoStats, you can still manually update index statistics by using the UPDATE STATISTICS T-SQL command, whose syntax is as follows:
UPDATE STATISTICS table | view
[ { { index | statistics_name }
| ( { index |statistics_name } [ ,...n ] ) } ]
[ WITH [ [ FULLSCAN ]
| SAMPLE number { PERCENT | ROWS } ]
| RESAMPLE
[ [ , ] [ ALL | COLUMNS | INDEX ]
[ [ , ] NORECOMPUTE ] ]
If neither the FULLSCAN nor SAMPLE
option is specified, the default behavior is to perform a sample scan to
calculate the statistics, and SQL Server automatically computes the
appropriate sample size.
The FULLSCAN option forces SQL Server to
perform a full scan of the data in the table or index to calculate the
statistics. This generates more accurate statistics than using sampling
but is also the most time-consuming and I/O-intensive method. When you
use the SAMPLE option, you can specify a fixed number of rows
or a percentage of rows to sample to build or update the index
statistics. If the sampling ratio specified ever results in too few rows
being sampled, SQL Server automatically corrects the sampling, based on
the number of existing rows in the table or view. At a minimum,
approximately 1,000 data pages are sampled.
The RESAMPLE option specifies that the statistics be generated using the previously defined sampling ratio. This RESAMPLE
option is useful for indexes or column statistics created with
different sampling values. For example, if the index statistics were
created using FULLSCAN, and the column statistics were created using a 50% sample, specifying the RESAMPLE option would update the statistics using FULLSCAN on the indexes and using the 50% sample for the others.
Specifying ALL, COLUMNS, or INDEX specifies whether the UPDATE STATISTICS command affects all existing statistics or only column or index statistics. By default, if no option is specified, the UPDATE STATISTICS statement affects all statistics.
As previously discussed, SQL Server automatically updates the index statistics by default. If you specify the NORECOMPUTE option with UPDATE STATISTICS, it disables AutoStats for the table or index.
When the automatic update statistics option is turned off, you should run the UPDATE STATISTICS command periodically, when appropriate. To determine the last time statistics were updated, you run the following command:
select STATS_DATE(tableid, indexid)
Following is an example:
select STATS_DATE(object_id('authors'), 1)
go
-----------------------
2010-03-15 00:04:51.407
Tip
You can get the index ID from sys.indexes for each index on a table by using the following query:
select name, index_id from sys.indexes
Where object_id = object_id('table_name') and index_id > 0
Column-Level Statistics
In addition to statistics on indexes, SQL Server can
also store statistics on individual columns that are not part of any
indexes. Knowing the likelihood of a particular value being found in a
nonindexed column can help the Query Optimizer better estimate the
number of matching rows for SARGs on the nonindexed columns. This helps
it determine the optimal execution plan, whether or not SQL Server is
using an index to actually locate the rows.
For example, consider the following query:
select stor_name
from stores st
join sales s on (st.stor_id = s.stor_id)
where s.qty <= 100
SQL Server knows the density of the stor_id column in both the sales and stores tables because of indexes on the column in those tables. There is no index on qty. However, if the Query Optimizer were to know how many rows in the sales table had a qty less than 100, it would be better able to choose the most efficient query plan for joining between sales and stores.
For example, assume that, on average, there are approximately 500 sales
per store. However, there are only approximately 5 sales per store
where the qty is less than 100. With the statistics on qty, SQL Server has the opportunity to determine this, and knowing there might be only 5 matching rows per store in sales versus 500, it might choose a different, more efficient, join strategy between the two tables.
Being able to keep statistics on the qty column without having to add it to an existing index with stor_id or create a separate index on qty provides SQL Server with the selectivity information it needs for optimization. By not having to create an index on qty
to generate statistics on the column, you avoid incurring the overhead
of having to maintain the index key rows for each insert, update, and
delete that occurs on the table. Only the index statistics on qty need to be maintained, which is required only after many modifications to the data have occurred.
By default, SQL Server generates column statistics
automatically when queries are optimized and the column is specified in a
SARG or join clause. If no column statistics exist and the Query
Optimizer needs to estimate the approximate density or distribution of
column values, SQL Server automatically generates statistics for that
column. This rule has two exceptions:
Statistics are not automatically created for
columns when the cost of creating the statistics exceeds the cost of the
query plan itself.
Statistics are not
automatically created when SQL Server is too busy (that is, when there
are too many outstanding query compilations in progress).
If you want to disable or re-enable the database option to autocreate statistics in the database, you use the ALTER DATABASE command:
ALTER DATABASE dbname SET AUTO_CREATE_STATISTICS { ON | OFF }
You can also turn the Auto Create Statistics option
on or off for the entire database by setting the database option in
SSMS. In Object Explorer, right-click the database to bring up the
Database Properties dialog, select the Options page, and set the Auto
Create Statistics option to True or False.
Column statistics are stored in the system catalogs. General information about them can be viewed in the sys.stats catalog view. Autogenerated statistics have a name in the format _WA_Sys_colname_systemgeneratednumber. You can retrieve a list of autogenerated column statistics with a query similar to the following:
SELECT cast(object_name(object_id) as varchar(30)) as 'table',
cast (name as varchar(30)) as autostats
FROM sys.stats
WHERE auto_created = 1
AND objectproperty (object_id, 'IsUserTable') = 1
go
table autostats
------------------------------ ------------------------------
table autostats
------------------------------ ------------------------------
authors _WA_Sys_state_4AB81AF0
sales _WA_Sys_ord_num_628FA481
sales _WA_Sys_qty_628FA481
stores _WA_Sys_state_6477ECF3
stores _WA_Sys_zip_6477ECF3
titles _WA_Sys_type_6A30C649
Creating Statistics
If you want finer control over how the column statistics are generated, you can use the CREATE STATISTICS command. Its syntax is similar to that of UPDATE STATISTICS, with the exception that you specify a column or list of columns instead of an index on which to create statistics:
CREATE STATISTICS statistics_name ON table (column [,...n])
[ WITH [ [ FULLSCAN | SAMPLE number { PERCENT | ROWS } ] [,] ]
[ NORECOMPUTE] ]
Any column that can be specified as an index key can
also be specified for statistics, except for XML columns or when the
maximum allowable size of the combined column values exceeds the
900-byte limit on an index key. Statistics can also be created on
computed columns if the ARITHABORT and QUOTED_IDENTIFIER database options are set to ON. In addition, statistics can be created on CLR user-defined type columns if the CLR type supports binary ordering.
If you want to create single-column statistics on all eligible columns in a database, you can use the sp_createstats system procedure:
sp_createstats [[@indexonly =] 'indexonly']
[,[@fullscan =] 'fullscan']
[,[@norecompute =] 'norecompute']
The created statistics have the same name as the
column on which they are created. Statistics are not created on columns
that already have statistics on them (for example, the first column of
an index or a column that already has explicitly created statistics).
To display a list of all column statistics, whether
autogenerated or manually created, you use a query similar to the
previous one, but you include user-created statistics as well:
SELECT cast(object_name(object_id) as varchar(30)) as 'table',
cast (name as varchar(30)) as name,
stats_id
FROM sys.stats
WHERE objectproperty (object_id, 'IsUserTable') = 1
and (auto_created = 1 or user_created = 1)
order by 1, 3
go
table name stats_id
------------------------------ ------------------------------ -----------
authors _WA_Sys_state_4AB81AF0 3
authors au_fname 4
authors phone 5
authors address 6
authors city 7
authors zip 8
discounts discounttype 2
discounts stor_id 3
discounts lowqty 4
discounts highqty 5
discounts discount 6
employee fname 3
employee minit 4
employee job_id 5
employee job_lvl 6
employee pub_id 7
employee hire_date 8
jobs job_desc 2
jobs min_lvl 3
jobs max_lvl 4
publishers pub_name 2
publishers city 3
publishers state 4
publishers country 5
roysched lorange 3
roysched hirange 4
roysched royalty 5
sales _WA_Sys_ord_num_628FA481 3
sales ord_date 4
sales _WA_Sys_qty_628FA481 5
sales payterms 6
stores _WA_Sys_state_6477ECF3 3
stores _WA_Sys_zip_6477ECF3 4
stores stor_name 5
stores stor_address 6
titleauthor au_ord 4
titleauthor royaltyper 5
titles _WA_Sys_type_6A30C649 3
titles pub_id 4
titles price 5
titles advance 6
titles royalty 7
titles ytd_sales 8
titles notes 9
titles pubdate 10
To remove a collection of statistics on one or more columns for a table in the current database, you use the DROP STATISTICS command, which has the following syntax:
DROP STATISTICS {table | view}.statistics_name
Be aware that dropping the column statistics could
affect how your queries are optimized, and less efficient query plans
might be chosen. Also, if the Auto Create Statistics option is enabled
for the database, SQL Server is likely to automatically create
statistics on the columns the next time they are referenced in a SARG or
join clause for a query.
String Summary Statistics
SQL Server 2008 supports string summary statistics for estimating the selectivity of LIKE
conditions. String summary statistics are statistical summaries of
substring frequency distribution for character columns. String summary
statistics can be created on columns of type text, ntext, char, varchar, and nvarchar. String summary statistics allow SQL Server to estimate the selectivity of LIKE conditions, where the search string may have any number of wildcards in any combination, including LIKE
conditions where the first character is a wildcard. In previous
versions of SQL Server, row estimates could not be accurately obtained
when the leading character of a search string was a wildcard character.
String summary statistics allow SQL Server to estimate the selectivity
of any of the following predicates:
String summary statistics include additional information beyond what is displayed by DBCC SHOW_STATISTICS
for the histogram. You can determine whether string summary statistics
have been created for a column or an index by examining the String Index column returned by DBCC SHOW_STATISTICS. If the value is YES, the statistics for that column or index also include a string summary. However, DBCC SHOW_STATISTICS does not display the actual contents of the string summary.