Welcome!

Article

Practical Tips to Reduce SQL Server Database Table Size

Boosting productivity while working with database tables

In this article, I'd like to elaborate on the subject of boosting productivity while working with database tables. Chances are you already know this from multiple resources on database development process.

However, the topic seems to become a front-burner issue when there are continuous data growths - tables become too large that leads to the performance loss.

It happens due to an ill-designed database schema that was not originally designed for handling large volumes of data.

To avoid the performance loss in the context of continuous data growth, you should stick to certain rules when designing a database schema.

Rule # 1 - Minimum Redundancy of Data Types

The fundamental unit of SQL Server data storage is the page. The disk space intended for a data file in a database is logically divided into pages numbered contiguously from 0 to n. In SQL Server, the page size is 8 KB. This means SQL Server databases have 128 pages per megabyte.

Disk I/O operations are performed at the page level. That is, SQL Server reads or writes whole data pages. The more compact data types is used, the less pages for storing that data are required, and as a result, less I/O operations needed.

Introduced in SQL Server, buffer pool significantly improves I/O throughput. The primary purpose of the SQL buffer pool is to reduce database file I/O and improve the response time for data retrieval.

Thus, when compact data types are used, the buffer pool stores larger amount of data on the same amount of pages. As a result, you will not waste the memory and reduce a number of logical operations.

Consider the following example - a table that stores working days of employees.

CREATE TABLE dbo.WorkOut1 (
DateOut DATETIME
, EmployeeID BIGINT
, WorkShiftCD NVARCHAR(10)
, WorkHours DECIMAL(24,2)
, CONSTRAINT PK_WorkOut1 PRIMARY KEY (DateOut, EmployeeID)
)

Are the selected data types correct? The most probable answer is no. It is unlikely that an enterprise has (2^63-1) employees. Therefore, the BIGINT is an unsuitable data type in this case.

We can remove this redundancy and estimate the query execution time.

CREATE TABLE dbo.WorkOut2 (
DateOut SMALLDATETIME
, EmployeeID INT
, WorkShiftCD VARCHAR(10)
, WorkHours DECIMAL(8,2)
, CONSTRAINT PK_WorkOut2 PRIMARY KEY (DateOut, EmployeeID)
)

The following execution plan demonstrates the cost difference which depends on a row size and expected number of rows.

The less data you need to retrieve, the faster query will run.

(3492294 row(s) affected)

SQL Server Execution Times:
CPU time = 1919 ms, elapsed time = 33606 ms.

(3492294 row(s) affected)

SQL Server Execution Times:
CPU time = 1420 ms, elapsed time = 29694 ms.

As you can see, the usage of non-redundant data types is a keystone for the best query performance. It also allows reducing the size of problem tables. By the way, you can execute the following query for measuring a table size:

SELECT
table_name = SCHEMA_NAME(o.[schema_id]) + '.' + o.name
, data_size_mb = CAST(do.pages * 8. / 1024 AS DECIMAL(8,4))
FROM sys.objects o
JOIN (
SELECT
p.[object_id]
, total_rows = SUM(p.[rows])
, total_pages = SUM(a.total_pages)
, usedpages = SUM(a.used_pages)
, pages = SUM(
CASE
WHEN it.internal_type IN (202, 204, 207, 211, 212, 213, 214, 215, 216,221, 222) THEN 0
WHEN a.[type]! = 1 AND p.index_id < 2 THEN a.used_pages
WHEN p.index_id < 2 THEN a.data_pages ELSE 0
END
)
FROM sys.partitions p
JOIN sys.allocation_units a ON p.[partition_id] = a.container_id
LEFT JOIN sys.internal_tables it ON p.[object_id] = it.[object_id]
GROUP BY p.[object_id]
) do ON o.[object_id] = do.[object_id]
WHERE o.[type] = 'U'

For the above-considered tables, the query returns the following results:

table_name           data_size_mb
------- -----------
dbo.WorkOut1         167.2578
dbo.WorkOut2         97.1250

Rule # 2 - Use Database Normalization and Avoid Data Duplication

Recently, I have analyzed a database of a free web service that allows formatting T-SQL code. The server part is quite simple over there and consists of a single table:

CREATE TABLE dbo.format_history (
session_id BIGINT
, format_date DATETIME
, format_options XML
)

Every time when formatting SQL code, the following parameters were saved to the database: current session ID, server time, and the settings that were applied while formatting user's SQL code.

This data subsequently were used for determining of most popular formatting styles. There were plans to add these styles to SQL Complete default formatting styles.

However, the service popularity rise led to a significant table rows increase, and profiles processing became slow. The settings had the following XML structure:

&lt;FormatProfile&gt; &lt;FormatOptions&gt; &lt;PropertyValue Name="Select_SelectList_IndentColumnList"&gt;true&lt;/PropertyValue&gt; &lt;PropertyValue Name="Select_SelectList_SingleLineColumns"&gt;false&lt;/PropertyValue&gt; &lt;PropertyValue Name="Select_SelectList_StackColumns"&gt;true&lt;/PropertyValue&gt; &lt;PropertyValue Name="Select_SelectList_StackColumnsMode"&gt;1&lt;/PropertyValue&gt; &lt;PropertyValue Name="Select_Into_LineBreakBeforeInto"&gt;true&lt;/PropertyValue&gt; ... &lt;PropertyValue Name="UnionExceptIntersect_LineBreakBeforeUnion"&gt;true&lt;/PropertyValue&gt; &lt;PropertyValue Name="UnionExceptIntersect_LineBreakAfterUnion"&gt;true&lt;/PropertyValue&gt; &lt;PropertyValue Name="UnionExceptIntersect_IndentKeyword"&gt;true&lt;/PropertyValue&gt; &lt;PropertyValue Name="UnionExceptIntersect_IndentSubquery"&gt;false&lt;/PropertyValue&gt; ... &lt;/FormatOptions&gt;

&lt;/FormatProfile&gt;

450 formatting options in total. Each row takes 33 KB in the table. The daily data growth exceeds 100 MB. As an obvius outcome, the database has been expanding day by day, thus making data analysis yet more complicated .

Surprisingly, the salvation turned out to be quite easy: all unique profiles were placed into a separate table, where a hash was defined for every set of options. As of SQL Server 2008, you can use the sys.fn_repl_hash_binary function for this.

So the DB schema has been normalized:

CREATE TABLE dbo.format_profile (
format_hash BINARY(16) PRIMARY KEY
, format_profile XML NOT NULL
)
CREATE TABLE dbo.format_history (
session_id BIGINT
, format_date SMALLDATETIME
, format_hash BINARY(16) NOT NULL
, CONSTRAINT PK_format_history PRIMARY KEY CLUSTERED (session_id,format_date)
)

And if the previous query looked like this:

SELECT fh.session_id, fh.format_date, fh.format_options
FROM SQLF.dbo.format_history fh

The new schema required the JOIN usage to retrieve the same data:

SELECT fh.session_id, fh.format_date, fp.format_profile
FROM SQLF_v2.dbo.format_history fh
JOIN SQLF_v2.dbo.format_profile fp ON fh.format_hash = fp.format_hash

If we compare the execution time for two queries, we can hardly see the advantages of the schema changes.

(3090 row(s) affected)

SQL Server Execution Times:
CPU time = 203 ms, elapsed time = 4698 ms.

(3090 row(s) affected)

SQL Server Execution Times:
CPU time = 125 ms, elapsed time = 4479 ms.

But in this case, the goal was to decrease time for analysis. Before we had to write an intricate query for getting the list of popular formatting profiles:

;WITH cte AS (
SELECT
fh.format_options
, hsh = sys.fn_repl_hash_binary(CAST(fh.format_options AS VARBINARY(MAX)))
, rn = ROW_NUMBER() OVER (ORDER BY 1/0)
FROM SQLF.dbo.format_history fh
)
SELECT c2.format_options, c1.cnt
FROM (
SELECT TOP (10) hsh, rn = MIN(rn), cnt = COUNT(1)
FROM cte
GROUP BY hsh
ORDER BY cnt DESC
) c1
JOIN cte c2 ON c1.rn = c2.rn
ORDER BY c1.cnt DESC

Now due to the data normalization, we managed to simplify the query:

SELECT
fp.format_profile
, t.cnt
FROM (
SELECT TOP (10)
fh.format_hash
, cnt = COUNT(1)
FROM SQLF_v2.dbo.format_history fh
GROUP BY fh.format_hash
ORDER BY cnt DESC
) t
JOIN SQLF_v2.dbo.format_profile fp ON t.format_hash = fp.format_hash

As well as to decrease the query execution time:

(10 row(s) affected)

SQL Server Execution Times:
CPU time = 2684 ms, elapsed time = 2774 ms.

(10 row(s) affected)

SQL Server Execution Times:
CPU time = 15 ms, elapsed time = 379 ms.

In addition, the database size has decreased:

database_name    row_size_mb
------ -----
SQLF             123.50
SQLF_v2          7.88

To retrieve a file size, the following query can be used:

SELECT
database_name = DB_NAME(database_id)
, row_size_mb = CAST(SUM(CASE WHEN type_desc = 'ROWS' THEN size END) *8. / 1024 AS DECIMAL(8,2))
FROM sys.master_files
WHERE database_id IN (DB_ID('SQLF'), DB_ID('SQLF_v2'))
GROUP BY database_id

Hope I managed to demonstrate the importance of data normalization.

Rule # 3 - Be careful while selecting indexed columns.

An index is an on-disk structure associated with a table or view that speeds retrieval of rows from a table or a view. Indexes are stored on pages, thus, the less pages is required to store indexes, the faster search process is. It is extremely important to be careful while selecting clustered indexed columns, because all the clustered index columns are included in every non-clustered index. Due to this fact, a database size can increase dramatically.

Rule # 4 - Use Consolidated Tables.

You do not need to execute a complex query on a large table. Instead, you can execute a simple query on a small table.

For instance, we have the following consolidation query

SELECT
WorkOutID
, CE = SUM(CASE WHEN WorkKeyCD = 'CE' THEN Value END)
, DE = SUM(CASE WHEN WorkKeyCD = 'DE' THEN Value END)
, RE = SUM(CASE WHEN WorkKeyCD = 'RE' THEN Value END)
, FD = SUM(CASE WHEN WorkKeyCD = 'FD' THEN Value END)
, TR = SUM(CASE WHEN WorkKeyCD = 'TR' THEN Value END)
, FF = SUM(CASE WHEN WorkKeyCD = 'FF' THEN Value END)
, PF = SUM(CASE WHEN WorkKeyCD = 'PF' THEN Value END)
, QW = SUM(CASE WHEN WorkKeyCD = 'QW' THEN Value END)
, FH = SUM(CASE WHEN WorkKeyCD = 'FH' THEN Value END)
, UH = SUM(CASE WHEN WorkKeyCD = 'UH' THEN Value END)
, NU = SUM(CASE WHEN WorkKeyCD = 'NU' THEN Value END)
, CS = SUM(CASE WHEN WorkKeyCD = 'CS' THEN Value END)
FROM dbo.WorkOutFactor
WHERE Value > 0
GROUP BY WorkOutID

If there is no need to often change the table data, we can create a separate table

SELECT *
FROM dbo.WorkOutFactorCache

The data retrieval from such consolidated table will be much faster:

(185916 row(s) affected)

SQL Server Execution Times:

CPU time = 3448 ms, elapsed time = 3116 ms.

(185916 row(s) affected)

SQL Server Execution Times:

CPU time = 1410 ms, elapsed time = 1202 ms.

Rule # 5 - Every rule has an exception

I've shown a couple of examples that demonstrated how to eliminate redundant data types and shorten queries execution time. But it does not always happen.

For instance, the BIT data type has a peculiarity -  SQL Server optimizes the storage of such columns group on a disk. If a table contains 8 (or less) columns of the BIT type, they are stored in the page as 1 byte. And if the table contains 16 columns of the BIT type, they are stored in the page as 2 bytes etc. The good news is that the table will take up little space and reduce disc I/O.

The bad news is that an implicit decoding will take place while retrieving data,  and the process is very demanding in terms of CPU resources.

Here is the example. Assume we have 3 identical tables containing information about employees work schedule (31 + 2 PK columns).  The only difference between tables is the data type for consolidated values (1- presence, 2 - absence)

SELECT * FROM dbo.E_51_INT
SELECT * FROM dbo.E_52_TINYINT
SELECT * FROM dbo.E_53_BIT

When using less redundant data types, the table size decreases considerably (especially the last table)

table_name           data_size_mb
------- -----
dbo.E31_INT          150.2578
dbo.E32_TINYINT      50.4141
dbo.E33_BIT          24.1953

However, there is no significant speed gain from using the BIT type

(1000000 row(s) affected)

Table ‘E31_INT'. Scan count 1, logical reads 19296, physical reads 1, read-ahead reads 19260, ...

SQL Server Execution Times:

CPU time = 1607 ms,  elapsed time = 19962 ms.

(1000000 row(s) affected)

Table ‘E32_TINYINT'. Scan count 1, logical reads 6471, physical reads 1, read-ahead reads 6477, ...

SQL Server Execution Times:

CPU time = 1029 ms,  elapsed time = 16533 ms.

(1000000 row(s) affected)

Table ‘E33_BIT'. Scan count 1, logical reads 3109, physical reads 1, read-ahead reads 3096, ...

SQL Server Execution Times:

CPU time = 1820 ms,  elapsed time = 17121 ms.

But the execution plan will show the opposite.

So the negative effect from the decoding will not appear if a table contains less than 8 BIT columns. One must note that the BIT data type is hardly used in SQL Server metadata. More often the BINARY data type is used, however it requires manual manipulations for obtaining  specific values.

Rule # 6 - Delete data that no longer required.

SQL Server supports a performance optimization mechanism called read-ahead. This mechanizm anticipates the data and index pages needed to fulfill a query execution plan and brings pages into the buffer cache before they are actually used by the query.

So if the table contains a lot of needless data, it may lead to unnecessary disk I/O. Besides, getting rid of needless data allows you to reduce the number of logical operations while reading from the Buffer Pool.

In conclusion, my advice is to be extremely careful while selecting data types for columns and try predicting future data loads.

More Stories By Jordan Sanders

Jordan Sanders is a Software Marketing Manager at Devart Company. He helps DBAs, software developers (C#, .NET, Delphi) from all around the globe to increase their productivity by using new tools, practices and new approaches to database development and management. He has experience in MySQL, SQL Server, Oracle databases consulting and also in Delphi development. He is always trying to share his knowledge and ideas with the community of his interest.

Latest Stories
"ZeroStack is a startup in Silicon Valley. We're solving a very interesting problem around bringing public cloud convenience with private cloud control for enterprises and mid-size companies," explained Kamesh Pemmaraju, VP of Product Management at ZeroStack, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
"Codigm is based on the cloud and we are here to explore marketing opportunities in America. Our mission is to make an ecosystem of the SW environment that anyone can understand, learn, teach, and develop the SW on the cloud," explained Sung Tae Ryu, CEO of Codigm, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
High-velocity engineering teams are applying not only continuous delivery processes, but also lessons in experimentation from established leaders like Amazon, Netflix, and Facebook. These companies have made experimentation a foundation for their release processes, allowing them to try out major feature releases and redesigns within smaller groups before making them broadly available. In his session at 21st Cloud Expo, Brian Lucas, Senior Staff Engineer at Optimizely, discussed how by using ne...
"CA has been doing a lot of things in the area of DevOps. Now we have a complete set of tool sets in order to enable customers to go all the way from planning to development to testing down to release into the operations," explained Aruna Ravichandran, Vice President of Global Marketing and Strategy at CA Technologies, in this SYS-CON.tv interview at DevOps Summit at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"There's plenty of bandwidth out there but it's never in the right place. So what Cedexis does is uses data to work out the best pathways to get data from the origin to the person who wants to get it," explained Simon Jones, Evangelist and Head of Marketing at Cedexis, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
Large industrial manufacturing organizations are adopting the agile principles of cloud software companies. The industrial manufacturing development process has not scaled over time. Now that design CAD teams are geographically distributed, centralizing their work is key. With large multi-gigabyte projects, outdated tools have stifled industrial team agility, time-to-market milestones, and impacted P&L stakeholders.
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
Enterprises are moving to the cloud faster than most of us in security expected. CIOs are going from 0 to 100 in cloud adoption and leaving security teams in the dust. Once cloud is part of an enterprise stack, it’s unclear who has responsibility for the protection of applications, services, and data. When cloud breaches occur, whether active compromise or a publicly accessible database, the blame must fall on both service providers and users. In his session at 21st Cloud Expo, Ben Johnson, C...
Data scientists must access high-performance computing resources across a wide-area network. To achieve cloud-based HPC visualization, researchers must transfer datasets and visualization results efficiently. HPC clusters now compute GPU-accelerated visualization in the cloud cluster. To efficiently display results remotely, a high-performance, low-latency protocol transfers the display from the cluster to a remote desktop. Further, tools to easily mount remote datasets and efficiently transfer...
"Infoblox does DNS, DHCP and IP address management for not only enterprise networks but cloud networks as well. Customers are looking for a single platform that can extend not only in their private enterprise environment but private cloud, public cloud, tracking all the IP space and everything that is going on in that environment," explained Steve Salo, Principal Systems Engineer at Infoblox, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventio...
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Agile has finally jumped the technology shark, expanding outside the software world. Enterprises are now increasingly adopting Agile practices across their organizations in order to successfully navigate the disruptive waters that threaten to drown them. In our quest for establishing change as a core competency in our organizations, this business-centric notion of Agile is an essential component of Agile Digital Transformation. In the years since the publication of the Agile Manifesto, the conn...