Database Sizing

Introduction

Once a project decides that a database will be needed, there are many questions that must be answered. Some of these are related to the database structure (tables, columns, relationships, rules). But there are also questions related to the ""physical" database, including

  • What types of data needs to be stored?
  • How "large" is the data?
  • How many estimated records will be stored?
  • How big does the database need to be in the beginning?
  • How fast will the database grow?
  • How much activity (and how large is that data) during typical operations?
  • How often will the database be backed up?
  • Are there occasional operations that will require a great deal of storage capacity, even for a limited time? (Imports, for example.)

Setting Up a New Database

Many cloud hosting sites provide a standard set of sizes when creating a new database. SmarterASP, for example, allows a new database to be created in increments of 50MB – with 50MB being the smallest. 50MB is an adequate size database for many small applications, such as a small website store.

Other sites, such as Azure, start databases as 1GB and have starting sizes such as 5 GB and 10GB, with different price points per size.

Many web hosting sites, such as GoDaddy.com and Namecheap.com, will often provide a free database with the purchase of a web hosting account. These are often $5 or less per month, and sometimes there is a choice between a free MSSQL or MySQL database.

( Caveat:

If the application or requirements are for a small number of tables, a set of tables with a limited amount of data, etc., a 50MB database is plenty. Everything for the DBMS 110 class is usually less than 50MB – for the entire class.

Calculating Estimated Database Size

Microsoft provides a very detailed approach to calculate database size requirements:

https://docs.microsoft.com/en-us/sql/relational-databases/databases/estimate-the-size-of-a-heap

This can be VERY complicated. If needing to configure very large databases, it might be necessary to perform such very detailed calculations. In other cases, general estimates are usually enough.

What are the main things to know or estimate?

  • How many tables will there be?
  • What is the general size of each table row, multiplied by the estimated rows in each table?
  • Are there requirements for very broad tables (large rows with large columns)?
  • Are there requirements for very large tables? (thousands or millions of rows?)
  • Are there requirements for LOTS of activity / transactions?
  • Are there requirements for very large columns, to store images, XML, etc.?

The size for string columns can be estimated using 1 byte per character, using the average or maximum size for each column (Unicode columns are twice as large for the same data). For example, assume a column is needed for "CompanyName":

  • the maximum size for the column is 80 characters
  • the average size of each company name is estimated at 40 characters
  • an expected 200 companies will be initially tracked
  • an expected 5 new companies will be added each month during the first year

The initial estimated amount of space needed for this single column is at 1 byte per character, or 80 bytes x 200 records = 16,000 bytes. Since the average estimated size per entry is 40 characters, the needed space may be between 8,000 bytes (40x200) and 16,000 bytes (80x200).

It is also helpful to consider how much data is expected to be added in the near future. In this case, it is estimated that 60 records (x80 bytes each) or 4,800 bytes may be needed in the first year for new companies.

(IF a column will need to hold Unicode characters, such as Asian, Greek, Check and Hindi languages, then the column type needs to NVARCHAR – and the data storage should be multiplied by 2.)

Dates and number columns have fixed number sizes (Dates might be 4 or 8 bytes per column; numbers are typically 1,2,4 or 8 bytes).

The bit datatype is unique. In SQL Server, it requires 1 bit per column. This means it requires 1 byte for each set of 1-8 bits.

Other references on storage sizes by data type:

Strategy

The general approach to calculate required storage needs is to

  1. determine what tables, columns and data types are needed
  2. calculate the estimate size needed for 1 row in each table
  3. estimate the number of records initially needed for each table
  4. estimate the number of records that will be needed for each table in the near future (such as one year)

Example – 1 Table

Let's consider a single table for customers. Assume the initial design for this table is:

CustomerID int
CustomerNamevarchar(80)
Address varchar(100)
City varchar(40)
StateCode char(2)
PostalCode varchar(10) - why varchar?
CreditLimit smallmoney
DateStart date
DateCreated smalldatetime – why?
IsActive bit

How big is a full row?

An int is 4 bytes. Smallmoney is 4 bytes. Date is 3 bytes; smalldatetime is 4 bytes; datetimeoffset required 8-10 bytes depending on the fractional seconds stored.

Up to 8 bits are 1 byte. So, the non-string columns for 1 row would require 16 bytes.

1 int, 1 smallmoney, 1 date, 1 small date, 1 bit

(1 * 4) + (1 * 4) + (1 * 3) + (1 * 4) + (1 * 1) = 4 + 4 + 3 + 4 + 1 = 16

If we assumed every string column will be completely filled, that is another 232 bytes. This is a total of 248 bytes. SQL Server uses some additional storage to track non-null columns, etc., but let's use 250 for a relatively round number.

This means 4 customers would require roughly 1000 bytes. A megabyte (MB) is roughly 1 million bytes. Using the rough numbers is okay and a conservative estimate.

So if we anticipated having 4000 customers, we would need roughly 1MB of storage space.

If we anticipated 50 tables of about the same row size and number of records, we would need 50MB. That is a lot of data.

In reality, if our initial estimated row and table sizes total X, we would want to multiply that by 1.4 or 1.5 to estimate a comfortable starting database size. So, if we estimated our initial data needed 20 MB, we might want a database of at least 30 MB for indexes, transactions, large import or other operations, etc.

We could also use various formulas – such as the AVERAGE size of data in table rows – to tweak the estimated storage needs.

Special things to consider:

  • Indexes use storage. That includes non-clustered, unique and foreign-key indexes.
  • The transaction log retains every data activity. (The transaction log is part of the database). If there are operations that import (insert, delete, update) large sets of data – the database must have room to track those operations.
  • Transaction logs are cleared when a full backup is performed. This reduces the overall storage needed, since the transaction log starts empty again.
  • The database can be manually compressed on a periodic basis – although this slows performance while the compression takes place.

Planning for Growth, Typical and Unusual Activity

Growth Projections

In addition to the beginning stored requirements, one must plan for growth. Will the client add 10 customers a week or 4,000 customers per week? What tables are likely to grow quickly, and what are the data requirements for that growth?

MSSQL databases can be set to grow automatically – but it requires an extension process that can be slow. If a user is submitting a web order at the time the database has to be extended, there might be a 1-2 minute delay. It is best to avoid the AUTOGROW process when possible by periodically checking the database size and enlarging it when needed.

Cloud servers will often NOT automatically extend the database to a larger size. All activity might stop until the approval (and payment) process is completed. This depends on the hosting service, type of account, etc.

Typical and Unusual Activity

It has already been discussed that transactions can add size to the database. Consider a daily import process where an Admin user imports 9 or 10 large spreadsheets of data to analyze and eventually import into the database. Assume an example where the import process for the spreadsheet consume 3-4 MB of DB space. If a full database backup is performed nightly, this is probably a minor size bump. But what if the Admin user sometimes runs the same process 10 times before getting it right? Now the import process might consume 30MB in a single day! If the maximum DB size is set to 50MB, the database might run out of space and be locked until it is manually compressed or extended. The company might incur hours of down time due to a database that is too small to handle this typical or unusual activity.

Storage Planning Strategies

Normally, a new application (or new database) will have multiple versions: development, staging, test, production.

Development and Staging are good places to stress the database (and application) with large amounts of data, large numbers of transactions, etc. Besides making sure the database design is accurate and optimal, this process can assist with optimizing indexes, pinpointing bottlenecks, etc.

If a database has the potential to grow quickly, it makes sense from a planning perspective to start with a larger database and reduce it later, if possible.

Using T-SQL to Gather Statistics

If one has existing tables that are good examples of a database design that needs to be analyzed, there are a couple of statements that can help analyze the data:

Count: counting the rows in each table helps in calculating overall size requirements

LEN(column): the length command measures the actual size of string data in a column. By using MAX(LEN(columnname)) it is possible to determine what is the longest data currently in a column. However, LEN() only counts characters – not data space. If all the data is non-Unicode, this is sufficient.

A better way to determine data space used is to use the DATALENGTH() command. DATALENGTH will calculate the actual bytes used in a column, including both Unicode data.