Grouping and Aggregates

Introduction

Aggregates are SQL Server functions that provide calculations for columns in a dataset. An aggregate is a T-SQL function that performs a calculation on multiple rows of data. The main aggregates are COUNT, SUM, MIN, MAX and AVG. They can be used in a standalone fashion or as part of a query to calculate data by groups.

The GROUP BY function is used to instruct SQL Server to summarize data based on records that share values in one or more columns. For example, the GROUP BY function could be used to group records by state code, and then find the average of another column.

 

Consider a table for a small hardware store with the following records in a Products table. Here is a subset of the data:

There are foreign keys to two other tables (Departments and ProductTypes), but we will work with just this table for the moment. 

Counting Rows

The COUNT() function is used to return a count of rows.

For example, how many products does this store carry? We could simply execute a query using

select * from s06.products

but that returns a table. What we would like is just a single number: the answer to our question. This can be accomplished using COUNT:

Notice that COUNT takes a parameter and that we use an asterisk (*). This is because we are asking for a count of the rows, not any specific column. This is the standard way of using COUNT.

We can also assign the results of a query like this to a variable. For instance:

There are three things happening here. First, we declare a variable (@productscount) to store the answer we are going to receive from the query. Aggregates return numeric values, so our variable needs to be a number (int, decimal, etc.)

Second, we execute the query with the aggregate and assign the value to the variable using SELECT. This is a common requirement.

Lastly, we concatenate some text with the numeric variable, and use the SELECT statement to display the results. Because strings and numbers cannot be directly concatenated in T-SQL, we convert the variable to a string using CONVERT as part of this SELECT statement.

This same approach can be used for other aggregates. For example, what if we wanted to know “What is the most expensive product for this store?”.

Using the MAX and MIN Functions

MAX() and MIN() are used to find the highest and lowest values in a dataset. These functions are similar to COUNT(), except they require a column name as a parameter. Here is an example of using MAX for the price column:

This example illustrates using MAX() by itself and also with a set of statements using a variable and concatenation. MIN() works the same way, but finds the lowest value for a column. Notice that MAX() and MIN() do not “calculate” anything; they simply locate the single value that fits the request. There may actually be 10 rows that all have the same MAX() value, but the function only returns the highest value as a single number – not the number of rows with that value.

Finding an Average

The AVG() function is similar to MAX() and MIN(). The syntax is the same. The difference is that AVG() does do a set of calculations. It counts the number of rows with non-null values for the column requested, sums the total of those columns, and calculates the average.  It returns a single number.

This example again shows the function used by itself and as part of a concatenated string:

You may notice that the variable is defined as a smallmoney data type in this example. A decimal could also be used. Smallmoney contains four decimal places, and the first result illustrates that. When converted to a varchar, the result is rounded by default, which works well for this example.

Remember that AVG() only evaluates non-null values, meaning that if there were many rows where price is NULL, those would not affect the calculation. AVG() does not default values to zero, for example, or include those rows in the number of rows used in the averaging calculation.

Using SUM to Total Values

The SUM() function is used to add up or sum the values for multiple records. That does not make sense for a column like price, but it might be useful for something like qtyonhand. In other words, what is the total count or inventory of hardware products in this store?


 

A More Complicated Example

While the above examples use very simple SELECT statements, any of them can be enhanced with a WHERE clause to product different results. Multiple aggregate functions can also be used in the same statement.

For example, we could run the query illustrated below to retrieve a set of statistics for the products belonging to departmentid=1:

In this case, each column is assigned an alias, so the column headings have more meaning. 

Grouping Data

While aggregate functions can be useful on their own, as illustrated in the above examples, they are also very powerful when data is grouped.

Grouping is accomplished by using the GROUP BY clause. The GROUP BY clause tells SQL Server to group records based on one or more columns having the same values, before using the aggregate functions.

The T-SQL SELECT statement in the previous section isolated the records for a single department using the WHERE clause. GROUP BY will let us do the same thing, except we can do it for all departments.

Consider this query:

We previously used multiple aggregates in the same query. Here, we are requesting multiple aggregates again. We are also using the GROUP BY clause to instruct SQL Server to group rows by departmentid first, before performing the aggregate calculations. In other words, we are asking SQL Server to group each set of rows that have the same departmentid, then perform the aggregate calculations on only those rows.

We include departmentid in the SELECT so we can tell what department each set of data refers to. The GROUP BY clause is placed after the FROM clause (and the WHERE clause, when it exists). The ORDER BY clause is always placed last. We include an ORDER BY clause if we want to see the rows in a specific order; otherwise, SQL Server returns the data in a random order.

More Advanced Examples

The above example is very useful; it provides a lot of information with a relatively simple query. But how do we know the name for each department?

Normally we would perform a JOIN to retrieve data across multiple tables, and it is no different when using GROUP BY:

It would not be absolutely necessary to include the departmentid here, but perhaps we want to see those values. This is important because one of the rules of using GROUP BY is that any non-aggregate columns included in the SELECT statement must also be included in the GROUP BY statement. In this case, we are including two non-aggregate columns (departmentid, departmentname). Those columns basically refer to the same thing, but the Query Manager will throw an error if we do not include both of them in the GROUP BY statement.

As with all other cases, this example could be further defined using a WHERE clause.