Taking the Lag out of analysis

As any good DBA, I keep metrics on the databases I administer.  One such metric is database growth. By analyzing the size of the database over time, we can prevent costly operations such as file growths as well as planning for growth.  Being able to compare the space used within a database file to the actual file size itself is simple enough, but deriving growth rates is a different story.  The calculation is simple enough, but obtaining these values with a simple query isn't so straight forward.  This post is going to explore a few ways to obtain these values given a data set. Let's check it out.
We're going to use SQLFiddle to display some examples here.  We've got a very simple table containing the following fields: RecordID,RecordDate,DatabaseName,FileSizeGB,SpaceUsedGB.

Here's a quick query against the table with some records.

Our formula for growth rate is (t1-t0/t0)*100.  t1 being the current value and t0 being the previous value.  The crux of the issue here is displaying our growth rate along side our used space.   Its not exactly straight forward.  You can attempt to apply a number of self joins, but your query is going to get complicated and messy pretty quickly.  Fortunately there are a few functions and features with can use to simplify this.

Recursive Common Table Expressions

Without going into too much detail, a recursive common table expression (CTE) is a CTE that can reference itself. Its often used to develop hierarchies. The example in Books Online is the employee-manager hierarchy.  We're going to use it here to develop a straight forward query to obtain growth rates. Syntactically, a recursive CTE contains everything a standard CTE contains however, its internal query consists of an Anchor member (the initial or start value) and the recursive member (how we recursive apply the remaining results).

Here's our recursive CTE applied to our data.

Our rCTE is composed of the initial anchor followed by a union all followed by the recursive member, which is very similar to the first query, with the inclusion of a join on the rCTE itself.  We recursively cycle through the results with the t1.RecordID = rCTE.RecordID + 1 call.

This is all well and good. It gets us what we need, however, there are issues.  The most obvious is that our data is perfect.  Its in sequential order and that makes it pretty easy to create the simplest rCTE.  However, we all know that data is never that simple.  We're going to have many databases in our table and our recordID is almost never going to be sequential like that.  Granted, we could include the ROW_NUMBER() function to create a sequence of values to go against. I kept this purposefully simple the give an example. Another issue is that this is, well, recursive.  Its going through row by row for the recursive member.  I've run this locally to get some statics and the query plan.  On my machine, this query scanned MyTable 2 times, performing 25 reads, and scanned a worktable 2 times, performing 67 reads.  The query plan shows that the recursive member executes 11 times.
This is not very efficient. It might work fine for small data sets, but for large data sets, I'd look for an alternative.

The LAG Function

SQL Server contains a function LAG. This function accesses the previous record in a data set.  As it inherently requires some type of ordering, it requires an OVER clause which determines the ordering of the dataset.  The LAG function requires a few arguments and is relatively simple to use.  We specify LAG(return value, offset, default).  Our return set is the column from the previous record, the offest is the number of rows back and default is the value in case of null followed by our OVER clause. That's it.

Here it is in our example.

For the sake of readability, we've maintained the CTE. Its very simple compared to our rCTE.  We apply our LAG function specifying the SpaceUsedGB column as our return value, our offset as 1 and our default as SpaceUsedAgain. In our OVER clause, we order by RecordID.  Running both the rCTE and our Lag example side by side, the rCTE example costs 67% of the batch while our LAG example only cost 33%.  A few other bonuses for the LAG example: in terms of IO, it required only 1 scan on MyTable and only 2 reads. No scans or reads on the worktable. The query plan shows that the clustered index scan on executes once.  So no recursion.  Much better. Lastly, we're no longer dependent on sequential recordIDs.  In a real world example, we'll have a filter on the results so recordID will not be sequential.

Here's the LAG function outside of a CTE.

Compared to our CTE LAG example, we lose the CTE obviously.  I think we also lose a little readability as we have to call LAG a few times and that line gets a bit fuddled.  Performance wise, it almost identical to our CTE LAG example with only slight differences in the execution plan.  In a case like this, I'd opt for comfort level, readability and requirements.  CTE's are useful in some scenarios, however they have their own set of limitations.

There we have it, two quick ways to perform calculations on sequential records in a result set.

Quick Edit:  I ran these two methods against a million row result set.  On my laptop, the rCTE query ran in 14 seconds...The LAG function query ran in 4 seconds, 250% improvement!