SQL Indexing and Performance Part 1: SQL Storage and Indexing
Άρθρο από Lefteris Karafilis Mon, 29/11/2010 - 13:35
Before we begin our journey in SQL Indexing it is important to understand how SQL Server stores data.
Database FileGroups and Files
A MS SQL database is always using an .mdf file as the primary data file, a number of .ldf files as transaction logs, and optionally a number of .ndf files as secondary data files. Data files (.mdf and .ndf) may be organized to FileGroups for management and allocation purposes. By default the first group that is automatically created is called “Primary File Group” and each .mdf and .ndf file that is allocated it belongs to that group. System tables and the primary data file of a database are always belonging to the “Primary File Group”. Secondary data files may be assigned to a user file group by using the FILEGROUP keyword in a CREATE or ALTER DATABASE statement.
Pages and extents
The fundamental unit of storage in SQL server is the page. A page is always 8KB long and it is allocated at the primary and secondary data files. I/O operations are performed at the page level and for SQL server to efficiently manage the pages; it groups them to a total of 8 contiguous data pages in a unit called extent. Note, that log files are not divided into pages since they consist of a series of log records.
Extents are subdivided in mixed and uniform extents. A uniform extent contains pages from a single object while a mixed extent contains pages from various objects. Since an extent is always contain 8 pages, a mixed type extent can contain up to 8 different page object owners.
Pages from a new table or index are usually allocated to a mixed extent unless you create an index on an existing table that has enough rows to generate 8 pages; in that case it is going to be allocated in a uniform extent. As a table or index, which initially allocated to a mixed extent, grows to a point that it has 8 pages; it switches to use uniform extents for subsequent allocations.
Like extents pages are subdivided into categories: data pages, index pages, large object pages and other pages.
Data pages store all data values except those typed as large types such text, xml, varchar(max), etc which are stored in the large object pages. Since SQL server data rows cannot span multiple data pages, large types are stored in a different page, the large object page, and are referred by a small pointer in the original data page. Some data types like varchar and varbinary, although can grow large, may be stored in the original data or index page if the row does not exceed the 8KB page limit. SQL server can dynamically move forth and back such columns between data page and large object page according to the row size limit.
Index pages store index entries like the index keys and some additional information used by SQL server to locate data and manage index process while other pages are special pages that used by SQL server to store configuration information like index application map, bulk changes map, etc.
Data Page Structure
A data page consists of the 96-byte page header, the data rows and the data row offsets. Page header existence is important because it stores necessary information about the page including the page number, page type, the amount of page’s free space, pointers to the previous and next pages (in case pages are linked together) and the allocation unit ID of the page object owner.
Row offsets table contains 2-byte information about each row in the table, meaning the row number and the row offset byte address in the page. Since our first element in a page is the 96-byte header, the first row in the page is at byte offset 96. If a page has 3 rows, 30 byte long each (including overhead), then the second row will be at offset 96+30=126 and the third at 126+30=156.
Non-clustered indexes and data page relationship
This information is particularly important in the case of non-clustered indexes because such indexes may store a pointer to the data rows in their leaf-level index pages. This pointer is called row ID (RID) and includes information about the file ID, the page ID and the row ID. Each time a non-clustered index refers to a row by RID, it is going to ask for a row number, at a specific page (ID) in a specific file (ID). SQL server then is going to refer to the row offset table to retrieve the desired row by using the byte offset of the desired row. Thanks to the offset table, a row can change position in a page without the need to change the RID. For example, if ROW0 were deleted, SQL server may move ROW2 to ROW0 to keep the contiguous of free space in case of a new row insert. The ID of ROW2 would remain the same although the byte offset would change from 156 to 96 to accommodate the row movement to the different slot in the data page.
Indexes and index page performance
As previously mentioned indexes are stored in index pages separately from data pages, which means that they can allocated to different FileGroups and even stored to different physical HDD from the original data pages to improve performance.
Since I/O operations are performed at the page level, it is important for an index to allocate as many rows as possible per index page. To achieve this you need to consider the index column type and the number of columns in a composite index. Thus it is best to use narrow index keys like INT instead of larger ones and to split large composite keys to smaller chunks whenever possible to accommodate more rows in an index page.
Index page splitting can cause index page fragmentation resulting reduced performance. Thus it is important to take into account index defragmentation. Data page splitting can also affect index performance particularly in non-clustered indexes because of the bookmarking at the leaf level of the index. Consider the following example; you want to retrieve column A, B, C and D from Table T1 and you have a non-clustered index on column B. While fetching the data, SQL is looking at your index and decides to use that index to filter returning data. For each matching row in the index, SQL has to jump back to the data page to retrieve column A, C and D. If there is a page split on the data pages, SQL has to do one more jump to the next page to successfully retrieve all rows.