SQL Server 2008 R2 : Index Statistics (part 2) - Estimating Rows Using Index Statistics, Generating and Maintaining Index and Column Statistics

8/18/2013 11:08:17 AM

4. Estimating Rows Using Index Statistics

How does the Query Optimizer use the index statistics to estimate the number of rows that match the SARGs in a query?

SQL Server uses the histogram information when searching for a known value being compared to the leading column of the index key column, especially when the search spans a range or when there are duplicate values in the key. Consider this query on the sales table in the bigpubs2008 database:

select * from sales
    where title_id = 'BI3976'

Because there are duplicates of title_id in the table, SQL Server uses the histogram on title_id (refer to Listing 34.5) to estimate the number of matching rows. For the value of BI3976, it would look at the EQ_ROWS value, which is 260.778. This indicates that there are approximately 261 rows in the table that have a title_id value of BI3976.

When an exact match for the search argument is not found as a step in the histogram, SQL Server uses the AVG_RANGE_ROWS value for the next step greater than the search value. For example, SQL Server would estimate that for a search value of 'BI4184', on average, it would >match approximately 300.0652 rows because that is the AVG_RANGE_ROWS value for the step value of 'BI8448', which is the next step greater than 'BI3976'.

When the query is a range retrieval that spans multiple steps, SQL Server sums the RANGE_ROWS and EQ_ROWS values between the endpoints of the range retrieval. For example, when we use the histogram in Listing 34.5, if the search argument were where title_id <= 'BI3976', the row estimate would be 274.8199+639.6047+312.9337+893.1208+ 271.811+637.2789+260.778, or 3290.3470 rows.

As mentioned previously, when the histogram cannot be used, SQL Server uses just the index density to estimate the number of matching rows. The formula is straightforward for an equality search; it looks like this:

Row Estimate = Number of Rows in Table × Index Density

For example, to estimate the number of matching rows for any given title_id in the sales table, multiply the number of rows in the sales table by the index density for the title_id key (0.001862197), as follows:

select count(*) * 0.001862197 as 'Row Estimate'
from sales
go

Row Estimate
-------------------
314.199188825

If a query specifies both the title_id and stor_id as SARGs, and if the SARG for title_id is a constant expression that can be evaluated at optimization time, SQL Server uses both the index density on title_id and stor_id as well as the histogram on title_id to estimate the number of matching rows. For some data values, the estimated number of matching rows for title_id and stor_id calculated using the index density could be greater than the estimated number of rows that match the specific title_id, as determined by the histogram. SQL Server uses whichever is the smaller of the two to calculate the row estimate.

Multiplying the number of rows in the sales table by the index density for title_id, stor_id (5.997505E-06), you can see that it is nearly unique, essentially matching only a single row:

select count(*) * 5.997505E-06 as 'Row Estimate'
from sales

Row Estimate
--------------
1.011929031125

In this example, SQL Server would use the index density on title_id and stor_id to estimate the number of matching rows. In this case, it is estimated that the query will return, on average, one matching row.

5. Generating and Maintaining Index and Column Statistics

At this point, you might ask, “How do the index statistics get created?” and “How are they maintained?” The index statistics are first created when you create the index on a table that already contains data rows or when you run the UPDATE STATISTICS command. Index statistics can also be automatically updated by SQL Server. SQL Server can be configured to constantly monitor the update activity on the indexed key values in a database and update the statistics through an internal process, when appropriate.

Auto-Update Statistics

To automatically update statistics, an internal SQL Server process monitors the updates to a table’s columns to determine when statistics should be updated. SQL Server internally keeps track of the number of modifications made to a column via column modification counters (colmodctrs). SQL Server uses information about the table and the colmodctrs to determine whether statistics are out of date and need to be updated. Statistics are considered out of date in the following situations:

When the table size has gone from 0 to >0 rows
When the number of rows in the table at the time the statistics were gathered was 500 or fewer and the colmodctr of the leading column of the statistics object has changed by more than 500
When the table had more than 500 rows at the time the statistics were gathered and the colmodctr of the leading column of the statistics object has changed by more than 500 + 20% of the number of rows in the table

If the statistics are defined on a temporary table, there is an additional threshold for updating statistics every six column modifications if the table contains fewer than 500 rows.

The colmodctrs are incremented in the following situations:

When a row is inserted into the table
When a row is deleted from the table
When an indexed column is updated

Whenever the index statistics have been updated for a column, the colmodctr for that column is reset to 0.

When SQL Server generates an update of the column statistics, it generates the new statistics based on a sampling of the data values in the table. Sampling helps minimize the overhead of the AutoStats process. The sampling is random across the data pages, and the values are taken from the table or the smallest nonclustered index on the columns needed to generate the statistics. After a data page containing a sampled row has been read from disk, all the rows on the data page are used to update the statistical information.

Caution

Having up-to-date statistics on tables helps ensure that optimum execution plans are being generated for queries at all times. In most cases, you would want SQL Server to automatically keep the statistics updated. However, it is possible that Auto-Update Statistics can cause an update of the index statistics to run at inappropriate times in a production environment or in a high-volume environment to run too often. If this problem is occurring, you might want to turn off the AutoStats feature and set up a scheduled job to update statistics during off-peak periods. Do not forget to update statistics periodically; otherwise, the resulting performance problems might end up being much worse than the momentary ones caused by the AutoStats process.

To determine how often the AutoStats process is being run, you can use SQL Server Profiler to determine when an automatic update of index statistics is occurring by monitoring the Auto Stats event in the Performance event class.

If necessary, it is possible to turn off the AutoStats behavior by using the sp_autostats system stored procedure. This stored procedure allows you to turn the automatic updating of statistics on or off for a specific index or all the indexes of a table. The following command turns off the automatic update of statistics for an index named aunmind on the authors table:

Exec sp_autostats 'authors', 'OFF', 'aunmind'

When you run sp_autostats and simply supply the table name, it displays the current setting for the table as well as the database. Following are the settings for the authors table:

Exec sp_autostats 'authors'
go

Global statistics settings for [bigpubs2008]:
  Automatic update statistics: ON
  Automatic create statistics: ON

settings for table [authors]

Index Name                         AUTOSTATS Last Updated
------------------------ --------- ------------------------
[UPKCL_auidind]          ON        2009-10-19 01:23:47.263
[aunmind]                OFF       2010-03-14 22:20:52.177
[_WA_Sys_state_4AB81AF0] ON        2009-10-19 01:23:47.263
[au_fname]               ON        2009-10-19 01:23:47.280
[phone]                  ON        2009-10-19 01:23:47.293
[address]                ON        2009-10-19 01:23:47.310
[city]                   ON        2009-10-19 01:23:47.310
[zip]                    ON        2009-10-19 01:23:47.310

There are three other ways to disable auto-updating of statistics for an index:

Specify the STATISTICS_NORECOMPUTE clause when creating the index.
Specify the NORECOMPUTE option when running the UPDATE STATISTICS command.
Specify the NORECOMPUTE option when creating statistics with the CREATE STATISTICS command.

You can also turn AutoStats on or off for the entire database by setting the database option in SQL Server Management Studio; to do this, right-click the database in Object Explorer to bring up the Database Properties dialog, select the Options page, and set the Auto Update Statistics option to False. You can also disable or enable the AutoStats option for a database by using the ALTER DATABASE command:

ALTER DATABASE dbname SET AUTO_UPDATE_STATISTICS { ON | OFF }

Note

What actually happens when you execute sp_autostats or use the NORECOMPUTE option in the UPDATE STATISTICS command to turn off auto-update statistics for a specific index or table? SQL Server internally sets a flag in the system catalog to inform the internal SQL Server process not to update the index statistics for the table or index that has had the option turned off using any of these commands. To re-enable Auto Update Statistics, you either run UPDATE STATISTICS without the NORECOMPUTE option or execute the sp_autostats system stored procedure and specify the value 'ON' for the second parameter.

Asynchronous Statistics Updating

In versions prior to SQL Server 2005, when SQL Server determined that the statistics being examined to optimize a query were out of date, the query would wait for the statistics update to complete before compilation of the query plan would continue. This is still the default behavior in SQL Server 2008. However, the database option, AUTO_UPDATE_STATISTICS_ASYNC, can be enabled to support asynchronous statistics updating.

When the AUTO_UPDATE_STATISTICS_ASYNC option is enabled, queries do not have to wait for the statistics to be updated before compiling. Instead, SQL Server puts the out-of-date statistics on a queue to be updated by a worker thread, which runs as a background process. The query and any other concurrent queries compile immediately by using the existing out-of-date statistics. Because there is no delay for updated statistics, query response times are more predictable, even if the out-of-date statistics may cause the Query Optimizer to choose a less-efficient query plan. Queries that start after the updated statistics are ready use the updated statistics.

Manually Updating Statistics

Whether or not you’ve disabled AutoStats, you can still manually update index statistics by using the UPDATE STATISTICS T-SQL command, whose syntax is as follows:

UPDATE STATISTICS table | view
    [ { { index | statistics_name }
          | ( { index |statistics_name } [ ,...n ] ) } ]
   [ WITH [ [ FULLSCAN ]
            | SAMPLE number { PERCENT | ROWS } ]
            | RESAMPLE
        [ [ , ] [ ALL | COLUMNS | INDEX ]
        [ [ , ] NORECOMPUTE ] ]

If neither the FULLSCAN nor SAMPLE option is specified, the default behavior is to perform a sample scan to calculate the statistics, and SQL Server automatically computes the appropriate sample size.

The FULLSCAN option forces SQL Server to perform a full scan of the data in the table or index to calculate the statistics. This generates more accurate statistics than using sampling but is also the most time-consuming and I/O-intensive method. When you use the SAMPLE option, you can specify a fixed number of rows or a percentage of rows to sample to build or update the index statistics. If the sampling ratio specified ever results in too few rows being sampled, SQL Server automatically corrects the sampling, based on the number of existing rows in the table or view. At a minimum, approximately 1,000 data pages are sampled.

The RESAMPLE option specifies that the statistics be generated using the previously defined sampling ratio. This RESAMPLE option is useful for indexes or column statistics created with different sampling values. For example, if the index statistics were created using FULLSCAN, and the column statistics were created using a 50% sample, specifying the RESAMPLE option would update the statistics using FULLSCAN on the indexes and using the 50% sample for the others.

Specifying ALL, COLUMNS, or INDEX specifies whether the UPDATE STATISTICS command affects all existing statistics or only column or index statistics. By default, if no option is specified, the UPDATE STATISTICS statement affects all statistics.

As previously discussed, SQL Server automatically updates the index statistics by default. If you specify the NORECOMPUTE option with UPDATE STATISTICS, it disables AutoStats for the table or index.

When the automatic update statistics option is turned off, you should run the UPDATE STATISTICS command periodically, when appropriate. To determine the last time statistics were updated, you run the following command:

select STATS_DATE(tableid, indexid)

Following is an example:

select STATS_DATE(object_id('authors'), 1)
go

-----------------------
2010-03-15 00:04:51.407

Tip

You can get the index ID from sys.indexes for each index on a table by using the following query:

select name, index_id from sys.indexes
   Where object_id = object_id('table_name') and index_id > 0

Column-Level Statistics

In addition to statistics on indexes, SQL Server can also store statistics on individual columns that are not part of any indexes. Knowing the likelihood of a particular value being found in a nonindexed column can help the Query Optimizer better estimate the number of matching rows for SARGs on the nonindexed columns. This helps it determine the optimal execution plan, whether or not SQL Server is using an index to actually locate the rows.

For example, consider the following query:

select stor_name
   from stores st
   join sales s on (st.stor_id = s.stor_id)
   where s.qty <= 100

SQL Server knows the density of the stor_id column in both the sales and stores tables because of indexes on the column in those tables. There is no index on qty. However, if the Query Optimizer were to know how many rows in the sales table had a qty less than 100, it would be better able to choose the most efficient query plan for joining between sales and stores. For example, assume that, on average, there are approximately 500 sales per store. However, there are only approximately 5 sales per store where the qty is less than 100. With the statistics on qty, SQL Server has the opportunity to determine this, and knowing there might be only 5 matching rows per store in sales versus 500, it might choose a different, more efficient, join strategy between the two tables.

Being able to keep statistics on the qty column without having to add it to an existing index with stor_id or create a separate index on qty provides SQL Server with the selectivity information it needs for optimization. By not having to create an index on qty to generate statistics on the column, you avoid incurring the overhead of having to maintain the index key rows for each insert, update, and delete that occurs on the table. Only the index statistics on qty need to be maintained, which is required only after many modifications to the data have occurred.

By default, SQL Server generates column statistics automatically when queries are optimized and the column is specified in a SARG or join clause. If no column statistics exist and the Query Optimizer needs to estimate the approximate density or distribution of column values, SQL Server automatically generates statistics for that column. This rule has two exceptions:

Statistics are not automatically created for columns when the cost of creating the statistics exceeds the cost of the query plan itself.
Statistics are not automatically created when SQL Server is too busy (that is, when there are too many outstanding query compilations in progress).

If you want to disable or re-enable the database option to autocreate statistics in the database, you use the ALTER DATABASE command:

ALTER DATABASE dbname SET AUTO_CREATE_STATISTICS { ON | OFF }

You can also turn the Auto Create Statistics option on or off for the entire database by setting the database option in SSMS. In Object Explorer, right-click the database to bring up the Database Properties dialog, select the Options page, and set the Auto Create Statistics option to True or False.

Column statistics are stored in the system catalogs. General information about them can be viewed in the sys.stats catalog view. Autogenerated statistics have a name in the format _WA_Sys_colname_systemgeneratednumber. You can retrieve a list of autogenerated column statistics with a query similar to the following:

SELECT cast(object_name(object_id) as varchar(30)) as 'table',
       cast (name as varchar(30)) as autostats
   FROM sys.stats
   WHERE auto_created = 1
     AND objectproperty (object_id, 'IsUserTable') = 1
go
table                          autostats
------------------------------ ------------------------------
table                          autostats
------------------------------ ------------------------------
authors                        _WA_Sys_state_4AB81AF0
sales                          _WA_Sys_ord_num_628FA481
sales                          _WA_Sys_qty_628FA481
stores                         _WA_Sys_state_6477ECF3
stores                         _WA_Sys_zip_6477ECF3
titles                         _WA_Sys_type_6A30C649

Creating Statistics

If you want finer control over how the column statistics are generated, you can use the CREATE STATISTICS command. Its syntax is similar to that of UPDATE STATISTICS, with the exception that you specify a column or list of columns instead of an index on which to create statistics:

CREATE STATISTICS statistics_name ON table (column [,...n])
    [    WITH  [ [ FULLSCAN | SAMPLE number { PERCENT | ROWS } ] [,] ]
        [ NORECOMPUTE]     ]

Any column that can be specified as an index key can also be specified for statistics, except for XML columns or when the maximum allowable size of the combined column values exceeds the 900-byte limit on an index key. Statistics can also be created on computed columns if the ARITHABORT and QUOTED_IDENTIFIER database options are set to ON. In addition, statistics can be created on CLR user-defined type columns if the CLR type supports binary ordering.

If you want to create single-column statistics on all eligible columns in a database, you can use the sp_createstats system procedure:

sp_createstats [[@indexonly =] 'indexonly']
        [,[@fullscan =] 'fullscan']
        [,[@norecompute =] 'norecompute']

The created statistics have the same name as the column on which they are created. Statistics are not created on columns that already have statistics on them (for example, the first column of an index or a column that already has explicitly created statistics).

To display a list of all column statistics, whether autogenerated or manually created, you use a query similar to the previous one, but you include user-created statistics as well:

Code View: Scroll / Show All

SELECT cast(object_name(object_id) as varchar(30)) as 'table',
       cast (name as varchar(30)) as name,
       stats_id
   FROM sys.stats
   WHERE objectproperty (object_id, 'IsUserTable') = 1
     and (auto_created = 1 or user_created = 1)
order by 1, 3
go

table                          name                           stats_id
------------------------------ ------------------------------ -----------
authors                        _WA_Sys_state_4AB81AF0         3
authors                        au_fname                       4
authors                        phone                          5
authors                        address                        6
authors                        city                           7
authors                        zip                            8
discounts                      discounttype                   2
discounts                      stor_id                        3
discounts                      lowqty                         4
discounts                      highqty                        5
discounts                      discount                       6
employee                       fname                          3
employee                       minit                          4
employee                       job_id                         5
employee                       job_lvl                        6
employee                       pub_id                         7
employee                       hire_date                      8
jobs                           job_desc                       2
jobs                           min_lvl                        3
jobs                           max_lvl                        4
publishers                     pub_name                       2
publishers                     city                           3
publishers                     state                          4
publishers                     country                        5
roysched                       lorange                        3
roysched                       hirange                        4
roysched                       royalty                        5
sales                          _WA_Sys_ord_num_628FA481       3
sales                          ord_date                       4
sales                          _WA_Sys_qty_628FA481           5
sales                          payterms                       6
stores                         _WA_Sys_state_6477ECF3         3
stores                         _WA_Sys_zip_6477ECF3           4
stores                         stor_name                      5
stores                         stor_address                   6
titleauthor                    au_ord                         4
titleauthor                    royaltyper                     5
titles                         _WA_Sys_type_6A30C649          3
titles                         pub_id                         4
titles                         price                          5
titles                         advance                        6
titles                         royalty                        7
titles                         ytd_sales                      8
titles                         notes                          9
titles                         pubdate                        10

To remove a collection of statistics on one or more columns for a table in the current database, you use the DROP STATISTICS command, which has the following syntax:

DROP STATISTICS {table | view}.statistics_name

Be aware that dropping the column statistics could affect how your queries are optimized, and less efficient query plans might be chosen. Also, if the Auto Create Statistics option is enabled for the database, SQL Server is likely to automatically create statistics on the columns the next time they are referenced in a SARG or join clause for a query.

String Summary Statistics

SQL Server 2008 supports string summary statistics for estimating the selectivity of LIKE conditions. String summary statistics are statistical summaries of substring frequency distribution for character columns. String summary statistics can be created on columns of type text, ntext, char, varchar, and nvarchar. String summary statistics allow SQL Server to estimate the selectivity of LIKE conditions, where the search string may have any number of wildcards in any combination, including LIKE conditions where the first character is a wildcard. In previous versions of SQL Server, row estimates could not be accurately obtained when the leading character of a search string was a wildcard character. String summary statistics allow SQL Server to estimate the selectivity of any of the following predicates:

Column LIKE 'string%'
Column LIKE '%string'
Column LIKE '%string%'
Column LIKE 'str[abc]ing'
Column LIKE '%abc%xy'

String summary statistics include additional information beyond what is displayed by DBCC SHOW_STATISTICS for the histogram. You can determine whether string summary statistics have been created for a column or an index by examining the String Index column returned by DBCC SHOW_STATISTICS. If the value is YES, the statistics for that column or index also include a string summary. However, DBCC SHOW_STATISTICS does not display the actual contents of the string summary.

Others

- SQL Server 2008 R2 : Index Statistics (part 1) - The Statistics Histogram, How the Statistics Histogram Is Used, Index Densities

- Keeping Windows 7 and Other Software Up to Date : Installing and Removing Software (part 2)

- Keeping Windows 7 and Other Software Up to Date : Installing and Removing Software (part 1)

- Keeping Windows 7 and Other Software Up to Date : Service Packs

- SharePoint 2010 : Performance Monitoring - Implementing Visual Round Trip Analyzer

- SharePoint 2010 : Performance Monitoring - What and how to monitor with Performance Monitor

- SharePoint 2010 : Performance Monitoring - Using SQL Profiler

- SharePoint 2010 : Performance Monitoring - Enabling HTTP Request Monitoring and Throttling

- SharePoint 2010 : Monitoring and Reporting - Enabling the Developer Dashboard

- SharePoint 2010 : Monitoring and Reporting - Troubleshooting with correlation IDs