Last Updated:

SQL in Data Warehouses: Analytical Data Processing. T-SQL Window Functions

 

Description of window functions

The window function is applied to the rowset. A window is a standard SQL term that describes the context in which a function runs. To specify a window in SQL, use the OVER clause. Here's an example request:

USE TSQL2012; SELECT orderid, orderdate, val, RANK() OVER(ORDER BY val DESC) AS ‘rank’ FROM Sales.OrderValues ORDER BY ‘rank’;

 

The OVER clause defines a window, or exact rowset in relation to the current row, an indication of ordering (if necessary), and other elements. There are no elements that constrain the rowset in the window—as in this example—because the rowset of the window is the final rowset for the query.

 

To be more precise, a window is a set of rows, or a relationship, that is provided as input to the logical query processing step in which the window is defined. But while such a definition is not entirely clear. Therefore, for simplicity, we will talk about the final set of query results - I will give a more detailed explanation later.

Ordering is naturally necessary for ranking purposes. In this example, it is based on the val column and provides a descending ranking.

In the example, we applied the RANK function. This function calculates the rank of the current row in a specific rowset according to the sort order. In descending order, as in this case, the rank of a row is determined by one more number of rows in the corresponding set, which have a higher sorting position than the current row.

Let's select one of the rows in the results of the sample query, for example, with a rank of 5. This rank is defined as 5 because according to the specified sort order (descending val), there are four rows in the final result set whose val attribute value is greater than the current value (11188.40), and the rank is defined as the number of these rows plus one.

 

But the most important nuance is that the OVER clause defines the function window in relation to the current row. This is true for all rows in the query result set. In other words, in each line, the OVER clause defines the window independently of the other lines. This is a very important concept, and it takes some time to realize it. Having mastered it, you will get closer to a real understanding of the principles of working with windows, as well as understand its depth. If it tells you a little while, don't worry until the time – I dumped it all in front of you just for a seed.

Support for window functions in SQL is described in the SQL:1999 standard, where they are called "OLAP functions". Since that time, in each new edition, support for window functions has only been strengthened. I'm referring to SQL:2003, SQL:2008, and SQL:2011 Editions. The latest SQL standard provides very broad support for window functions— whether there needs to be other evidence that the standards committee believes in them and is likely to expand the standard by increasing the number of window functions and related functionality.

Standards documents can be purchased from an ISO or ANSI organization. For example, you can purchase an ANSI foundation document from the following URL describing the SQL:2011 standard, which describes the language constructs.

The SQL standard provides support for several types of window functions: aggregate, ranking, analytical (or distribution), and panning. But we must not forget that windows are a principle, so new types may appear in the next editions of the standard.

 

In aggregate window functions, you'll find familiar aggregation functions such as SUM, COUNT, MIN, MAX, and others, but you're probably used to using them in the context of query groups. The aggregation function must work on a set that is defined by a group query or window definition. SQL Server 2005 introduced partial support for window functions, and SQL Server 2012 extended this functionality.

The following ranking functions are implemented: RANK, DENSE_RANK, ROW_NUMBER and NTILE. In the standard, the first and second pairs of functions fall into different categories, and I'll explain why later. I prefer to combine these four functions into one category for simplicity, just as I do in the official SQL Server documentation. In SQL Server 2005, these four ranking features are already fully functional.

Analytical functions include PERCENT_RANK, CUME_DIST, PERCENTILE_CONT, and PERCENTILE_DISC. Support for these features was introduced in SQL Server 2012.

Shift functions include LAG, LEAD, FIRST_VALUE, LAST, VALUE, and NTH_VALUE. Support for these features was also introduced in SQL Server 2012. There is no support for the NTH_VALUE feature in SQL Server, and SQL Server 2012 is no exception.

Below is a description of the purpose, goals, and features of the various functions.

When mastering any new idea, device or tool, you have to overcome a certain barrier, even if the novelty is easier to install and use. The new is always not easy. Therefore, if you are not familiar with window functions and are analyzing whether it is worth spending time studying and applying them, here are some arguments "for":

  • Window functions allow you to solve many problems of data extraction. I would say that this opportunity is difficult to overestimate. As I said, I now use window functions in most situations when you need to request data. After a story about the principles of operation and optimization of functions, examples of their practical application will be shown. But so that you already understand how they can be used, I will say that the following tasks can be solved by means of window functions:

    1. Pagination.

    2. Eliminate data duplication.

    3. Returns the first n rows in each group.

    4. Calculate cumulative totals.

    5. Perform operations at intervals, such as packing intervals, and calculate the maximum number of concurrent sessions.

    6. Find spaces and ranges.

    7. Calculate percentiles.

    8. Calculate the distribution mode.

    9. Sort hierarchies.

    10. Reduction.

    11. Definition of novelty.

  • I've been building SQL queries for almost a decade now, and I've been using windowing features extensively for the past few years. I can say that it takes some time to master window functions, but in many cases window functions are simpler and more "intuitive" than conventional methods.

  • Window functions lend themselves well to optimization.

Declarative language and optimization

You might wonder why, in a declarative language like SQL, where you simply state what you want to get rather than describe how to get what you're looking for, two forms of the same query—one with window functions and the other without—produce different performance. How is it that in a single SQL implementation, such as SQL Server, with its own T-SQL dialect, the DBMS does not always "guess" that the two forms are virtually identical, and does not create the same execution plans.

There are several reasons for this. For starters, the SQL Server optimizer isn't perfect. Don't get me wrong—SQL Server Optimizer is a miracle when it comes to the complex tasks it solves. But it is impossible to implement all possible rules in it. This is firstly, and secondly, the optimizer has a very limited time to perform optimization - he could spend more time on optimization, but you need to understand that this time should be compensated by accelerating the execution of the query.

 

Sometimes the situation reaches the point of absurdity: on the one hand, a plan that does not take into account all possible plans, but ensures the execution of the request in a few seconds, is created in a few milliseconds, and on the other hand, it can take a year or more to define a plan that takes into account all the options and provides a reduction in the query execution time by a few seconds.

As you can see, from a practical point of view, the optimizer has very limited time for optimization. Based on certain parameters, including the size of the tables used in the query, SQL Server determines two values: the first is the cost of a satisfactory plan, and the other is the maximum possible time that can be spent on optimization. When any of these values are reached, SQL Server uses the best-defined execution plan at that time.

The structure of window functions is such that they are often better optimized than other methods of solving the same problems.

From all that has been said, you should take out the following: to switch to the use of window functions, you need to make a certain conscious effort, because this is a new concept that you need to get used to. But after that, using window functions becomes simple and intuitive - remember how difficult it seemed to you to master gadgets, without which you now can not imagine your life.

Two types of programming: sets and cursors with an iterative pass

Often, T-SQL-based solutions for querying data are divided into two types: those based on sets or on cursors with an iterative pass. T-SQL developers agree that the first approach should be used, but cursors are still used in many solutions. This raises some interesting questions. Why are kits considered preferable? And if they are recommended for use, why do many developers take an iterative approach? What's stopping people from taking the recommended approach?

To understand this, you need to understand the basics of T-SQL, what a set-based approach really is. If you do this, you will realize that for most people the sets are not intuitive enough, and the logic of iterations is easier to understand. The thing is, the gap between iterative and set-based types of thinking is quite large. It can be shortened, but it's not easy. It is at this stage that window functions can play an important role. I think they're a great tool to bridge the gap between these two approaches and provide a smoother transition to thinking in terms of sets.

 

Therefore, I will first explain what a set-based approach to solving data acquisition problems by means of T-SQL is. T-SQL is a dialect of the standard SQL language (both ISO and ANSI standards). SQL is based (or is an attempt to implement) on the basis of a relational model, which is a mathematical model of data management, originally formulated and proposed by E.F. Codd in the late 1960s.

The relational model is based on two mathematical principles: set theory and predicate logic. Many aspects of computer computing are based on intuition, and they change very quickly – so quickly that sometimes it seems that you yourself look like a cat chasing its tail. The relational model is an island in the world of computer computing because it is based on a much more reliable foundation: mathematics. Some consider mathematics to be the ultimate truth. Strict mathematical foundations ensure the reliability and stability of the relational model. It is evolving, but not as fast as other areas of computer computing. For several decades, the relational model has remained unshakable and now it underlies the leading database platforms, which are called relational database management systems (RDBMS).

SQL is an attempt to create a language based on a relational model. SQL is imperfect and, frankly, deviates from the relational model in a number of nuances, but at the same time it provides enough tools so that a person who understands the relational model can use the relational capabilities with SQL tools. This language is the undisputed leading de facto language of modern RDBMS.

However, as I said, relational thinking is in many ways not intuitive. Part of the difficulty in learning to think in relational terms lies in the fundamental difference between iterative and set-based approaches. This is especially difficult for those who are accustomed to working with procedural programming languages in which interaction with data in files is carried out sequentially, as the following pseudocode demonstrates:

open file fetch first record while not end of file begin process record fetch next record end

The data in the files (or, more precisely, in indexed sequential access files, or ISAM files) is stored in a specific order. And you are guaranteed to receive records from the file in this order. Also, records can be received one at a time. Therefore, you get used to the fact that data is accessed in this way: in order and one record at a time. This is similar to working with a cursor in T-SQL. For this reason, for developers with procedural programming skills, using cursors or other iterative mechanisms is consistent with their experience and seems to be a logical way to work with data.

A relational, set-based approach to working with data is fundamentally different from similar skills. To better understand it, let us begin with the definition of a set, or set, belonging to the creator of set theory, Georg Cantor: "By 'set' we mean the connection into some whole M of certain clearly distinguishable objects m of our contemplation or our thinking (which will be called the 'elements' of the set M)."

It's a very succinct definition of a set — I'd have to spend a lot of time to articulate that thought. But for the purposes of our discussion, I will focus on two aspects — one explicitly stated in the definition, and the other implicitly:

Whole

Notice how the term integer is used. The set must be perceived and worked with as a single whole. The set should be treated as a whole, without distinguishing between its individual elements. Iterative processing violates this principle because file or cursor entries are handled sequentially and one at a time.

A table in SQL represents (though not entirely successfully) relation in terms of a relational model. When you interact with tables using set-based queries, you work with tables as a whole, as opposed to working with individual rows (relationship tuples), both in terms of wording a declarative SQL query and in terms of mentally relating to this operation. This way of thinking is not easy for many.

That is ok

Note that nowhere in the definition of the set is the order of the elements mentioned. There is a good reason for this - there is no order among the elements of the set. This is another feature that takes time to get used to. In files and cursors, elements are always arranged in a certain order, and when you receive records one at a time, you can expect to maintain this order.

There is no row order in the table because the table is a set. Those who don't understand this often confuse the logical layer of the data model and language with the physical level of implementation. They assume that if a table has a specific index, it is implicitly guaranteed that when the table is queried, the data will always be accessed in the order of the index. Sometimes even the logic of the solution is built on this assumption. Clearly, SQL Server does not guarantee this order. For example, the only way to guarantee a certain order of strings in a query result is to add an ORDER BY sort to the query. And if you add such a sentence, you need to understand that the result will not be relational, because it is guaranteed to be ordered.

If you need to create SQL queries and you want to understand this language, you need to think in terms of sets. In such a situation, window functions can help to fill the gap between the iterative (one line at a time in a certain order) and set-based types of thinking (treating the set as a whole in the absence of order). The transition from the old to the new thinking is facilitated by the original structure of window functions.

By the way, window functions support the ORDER BY clause, which is used in situations where you need to specify the order. But specifying the order in such a function does not mean that it violates relational principles. The input data of the query is relational, that is, without any ordering, however, as well as the output data, in which there is also no guarantee of order. In the conditions of the operation, ordering is set - only because the result attribute is present in the resulting relation. There is no guarantee that the resulting strings will be returned by the window function in the same order.

In essence, different window functions can specify different ordering options in the same query. This sorting has nothing to do with, at least conceptually, ordering in a view in a query.

In the figure below, I try to show that both the input and output of the window function are relational, even though the window function definition is set to order. In ovals and different order of rows in the input and output data I try to express the fact that the order of the rows does not matter

There is another feature of window functions that helps to gradually move from iterative thinking to thinking in terms of sets. Sometimes when explaining a new topic, teachers have to resort to a certain amount of "lies". Imagine that you are a teacher who understands that the consciousness of students is not yet ready to master a certain idea, if it is immediately stated in full. Sometimes it is better to start explaining the idea in simpler and not quite correct terms - this approach allows you to prepare students for mastering the material. Some time later, when students are "ripe" for understanding the "truth", you can move on to a deeper and more correct presentation of the material.

This is the situation that arises with an understanding of how window functions are computed in principle. There is a basic version of the explanation of this material, it is not quite correct, but it allows you to achieve the desired result! This option uses an approach based on processing one row from an ordered list. In the future, a deeper, more fundamentally correct way of explaining this principle is used, but before its presentation it is necessary to prepare the reader. The second method uses a set-based approach.

To see what I mean, look at the following query:

SELECT orderid, orderdate, val, RANK() OVER(ORDER BY val DESC) AS ‘rank’ FROM Sales.OrderValues;

sort rows by val iterative row pass for each string if the current row is the first row in the section return 1 if val is equal to the previous val value, return the previous rank else return the current number of rows processed

 

Although this way of thinking gives the right result, it is not entirely correct. To be honest, my task is further complicated by the fact that such a process is very similar to how rank calculation is physically implemented in SQL Server. But at this stage, my goal is not a physical implementation, but a conceptual level — a language and a logical model. By "wrong thinking, I mean that conceptually, in terms of language, computation is thought differently—in terms of sets, not iterations. Remember that the language has nothing to do with a specific physical implementation in the database engine. The task of the physical layer is to determine how to execute a logical query and return the correct result as quickly as possible.

And now I will try to explain what I mean by a deeper, correct understanding of the interpretation of window functions in the language. The function logically defines a separate, independent window for each row in the query result set. In the absence of restrictions in the window definition, at the beginning each window consists of a set of all rows of the query result set. You can add additional elements to the window definition (for example, partitioning, cropping, etc. - I'll talk about this in a moment) that further limit the rowset in each window.

Conceptually, from the perspective of each window function and string in the query result set, the OVER clause creates separate windows for them. In our query, we did not limit the definition of the window in any way, but only set the ordering. Therefore, in our case, all windows consist of all the rows in the result set. And they coexist at the same time. And in each rank, it is calculated as an increased number of rows by one whose value of the val attribute is greater than that of the current value.

You probably understand that it is easier for many to think in basic terms, that is, to assume that the data is ordered and the process iteratively goes through the lines, one at a time. And it's normal that you start with window functions, because you need to write queries correctly, at least simple ones. Over time, you can gradually move on to a deeper understanding of the conceptual construction of window functions and begin to think in terms of sets.

Disadvantages of alternatives to window functions

Window functions have a number of advantages over alternative, more traditional ways of performing similar tasks, such as group, nested, and other queries. I'll give you some simple examples. There are other differences besides the ones I'm demonstrating here, but it's too early to talk about them.

I'll start with traditional group queries. They give new information in the form of aggregates, but something is lost – details.

When grouping data, you are forced to apply all calculations in the context of the group. But what if you need to perform calculations that involve the use of both detailed data and aggregates. Imagine that you want to query the Sales.OrderValues view of the orders and calculate the size of each order as a percentage of the total amount for the customer, as well as the difference with the average order size for the corresponding customer. The current order size is detailed information, and the total and average values are aggregates. If you group data by customer, you lose access to the values of individual orders.

One way to accomplish this task by using group queries is to create a query that groups data by client, define a table expression based on that query, and then join the table expression to the underlying table to map the detailed data to the aggregations. Here's an expression that implements this logic:

use TSQL2012; WITH Aggregates AS ( SELECT custid, SUM(val) AS sumval, AVG(val) AS avgval FROM Sales.OrderValues GROUP BY custid ) SELECT O.orderid, O.custid, O.val, CAST(100. * O.val / A.sumval AS NUMERIC(5, 2)) AS pctcust, O.val — A.avgval AS diffcust FROM Sales.OrderValues AS O JOIN Aggregates AS A ON O.custid = A.custid;

Now imagine that you also need to calculate the order size as a percentage of the total amount of orders, as well as a deviation from the average for all customers. To solve this problem, you will have to add another table expression:

use TSQL2012; WITH CustAggregates AS ( SELECT custid, SUM(val) AS sumval, AVG(val) AS avgval FROM Sales.OrderValues GROUP BY custid ), GrandAggregates AS ( SELECT SUM(val) AS sumval, AVG(val) AS avgval FROM Sales.OrderValues ) SELECT O.orderid, O.custid, O.val, CAST(100. * O.val / CA.sumval AS NUMERIC(5, 2)) AS pctcust, O.val — CA.avgval AS diffcust, CAST(100. * O.val / GA.sumval AS NUMERIC(5, 2)) AS pctall, O.val — GA.avgval AS diffall FROM Sales.OrderValues AS O JOIN CustAggregates AS CA ON O.custid = CA.custid CROSS JOIN GrandAggregates AS GA;

As you can see, the query is getting more and more complex, including more and more table expressions and join operations.

Otherwise, you can solve the problem by using separate nested queries for each calculation. Here's an alternative solution based on nested queries:

use TSQL2012; — Nested queries with detailed data and aggregates for individual clients SELECT orderid, custid, val, CAST(100. * val / (SELECT SUM(O2.val) FROM Sales.OrderValues AS O2 WHERE O2.custid = O1.custid) AS NUMERIC(5, 2)) AS pctcust, val — (SELECT AVG(O2.val) FROM Sales.OrderValues AS O2 WHERE O2.custid = O1.custid) AS diffcust FROM Sales.OrderValues AS O1; — Nested queries with detailed data and aggregates for all and individual clients SELECT orderid, custid, val, CAST(100. * val / (SELECT SUM(O2.val) FROM Sales.OrderValues AS O2 WHERE O2.custid = O1.custid) AS NUMERIC(5, 2)) AS pctcust, val — (SELECT AVG(O2.val) FROM Sales.OrderValues AS O2 WHERE O2.custid = O1.custid) AS diffcust, CAST(100. * val / (SELECT SUM(O2.val) FROM Sales.OrderValues AS O2) AS NUMERIC(5, 2)) AS pctall, val — (SELECT AVG(O2.val) FROM Sales.OrderValues AS O2) AS diffall FROM Sales.OrderValues AS O1;

There are two main problems with this approach. First, the code is too long and complex. And second, the SQL Server optimizer currently doesn't recognize when multiple nested queries need to access the same rowset, so each nested query will have separate data access. This means that the more nested queries there are, the more times the same data will be read. Unlike the previous issue, this is not a language problem, but a specific implementation of nested queries in SQL Server.

As you may recall, the idea behind a window function is to define the window, or set, of strings on which the function works. It is assumed that aggregation functions are applied to rowsets, so the principle of windows should be well suited for such tasks as an alternative to groups of queries and nested queries. After calculating the aggregate window function, detailed information is not lost. You can use the OVER clause to define a function window.

For example, to calculate the sum of all the values in a query result set, you can simply use the following:

SUM(val) OVER()

If you do not limit the window (empty in parentheses), the starting point is the query result set.

To calculate the sum of all the values in the result set whose customer ID is the same as the current row, use the partitioning capabilities in the window functions (I'll talk about it later) and partition the window by custid as follows:

SUM(val) OVER(PARTITION BY custid)

Note that the term partitioning implies filtering, not grouping.

Here's how to use window functions to create a query that provides detailed data and aggregates by customer, returning as a percentage of the order size among all customer orders, as well as a deviation from the average order size:

use TSQL2012; SELECT orderid, custid, val, CAST(100. * val / SUM(val) OVER(PARTITION BY custid) AS NUMERIC(5, 2)) AS pctcust, val — AVG(val) OVER(PARTITION BY custid) AS diffcust FROM Sales.OrderValues;

And here is another request that additionally adds the order size as a percentage of the total amount of orders and the percentage of deviation from the total average for all orders:

SELECT orderid, custid, val, CAST(100. * val / SUM(val) OVER(PARTITION BY custid) AS NUMERIC(5, 2)) AS pctcust, val — AVG(val) OVER(PARTITION BY custid) AS diffcust, CAST(100. * val / SUM(val) OVER() AS NUMERIC(5, 2)) AS pctall, val — AVG(val) OVER() AS diffall FROM Sales.OrderValues;

Notice how much simpler and more concise the query with window functions looks. When it comes to optimization, the SQL Server Optimizer provides logic for detecting multiple functions with the same window definition. When SQL Server detects such features, it reads the required data only once (regardless of the type of read operation). For example, in the last query, SQL Server will turn once to the data to calculate the first two functions (sum and mean, partitioned by custid), and again to calculate the last two functions (unsectioned sum and average). I'll demonstrate this optimization principle later.

Another advantage of window functions over nested queries is that the original window is the result set of the query before the constraints are applied. This means that this is the result set after applying tabular operators (for example, connections) of filters, groups, etc. This result set is a consequence of which phase of logical processing of the query the window functions are calculated.

On the other hand, a nested query starts from scratch instead of a set of external query results. This means that if you want a nested query to work with the same set as an external query, you have to repeat the same constructs as the external query. As an example, imagine that you want to calculate a percentage of the amount and the deviation from the average only for orders for 2007. When using window functions, it is enough just to add one filter to the query:

use TSQL2012; SELECT orderid, custid, val, CAST(100. * val / SUM(val) OVER(PARTITION BY custid) AS NUMERIC(5, 2)) AS pctcust, val — AVG(val) OVER(PARTITION BY custid) AS diffcust, CAST(100. * val / SUM(val) OVER() AS NUMERIC(5, 2)) AS pctall, val — AVG(val) OVER() AS diffall FROM Sales.OrderValues WHERE orderdate >= ‘20070101’ AND orderdate < ‘20080101’;

The starting point of all window functions is the set after applying this filter. But in a solution with nested queries, you have to start all over again - the filter will have to be repeated in all nested queries:

use TSQL2012; SELECT orderid, custid, val, CAST(100. * val / (SELECT SUM(O2.val) FROM Sales.OrderValues AS O2 WHERE O2.custid = O1.custid AND orderdate >= ‘20070101’ AND orderdate < ‘20080101’) AS NUMERIC(5, 2)) AS pctcust, val — (SELECT AVG(O2.val) FROM Sales.OrderValues AS O2 WHERE O2.custid = O1.custid AND orderdate >= ‘20070101’ AND orderdate < ‘20080101’) AS diffcust, CAST(100. * val / (SELECT SUM(O2.val) FROM Sales.OrderValues AS O2 WHERE orderdate >= ‘20070101’ AND orderdate < ‘20080101’) AS NUMERIC(5, 2)) AS pctall, val — (SELECT AVG(O2.val) FROM Sales.OrderValues AS O2 WHERE orderdate >= ‘20070101’ AND orderdate < ‘20080101’) AS diffall FROM Sales.OrderValues AS O1 WHERE orderdate >= ‘20070101’ AND orderdate < ‘20080101’;
 

Clearly, you can use workarounds, such as pre-defining a generalized table expression (CTE) based on the query that performs the filtering, and then referencing the CTE from the external query and nested queries. However, I want to say that no tricks are needed with window functions, because they work based on the result of the query.

As mentioned earlier, window functions lend themselves well to optimization, and alternative solutions are often almost impossible to optimize. Naturally, there are exactly the opposite situations.

Window functions

Window functions are defined in the ISOSQL standard. The DBMS of the MS SQLServer family provides ranking and statistical window functions. A window is a user-defined rowset. The window function calculates the value for each row in the result set retrieved from the window.

Window functions can be used to compute cumulative, sliding, and centered aggregates. They return a value for each row in the table that depends on the other rows in the corresponding window. They can only be used in the SELECT and ORDER BY clauses of a query. Typically, window functions provide access to more than one row of a table without self-assembly.

Offer OVER

The OVER clause defines the partitioning and ordering of the rowset before applying the appropriate window function. Aggregate and statistical (SUM, AVG, MAX, MIN, COUNT) ranking functions are used as window functions. Each of the ranking functions ROW_NUMBER, DENSE_RANK, RANK, and NTILE uses the OVER clause (see the section of this lecture).

Syntax:

  • For ranking window functions< OVER_CLAUSE > ::= OVER ( [PARTITION BY value_expression, ... [n] ] )
  • For aggregate functions< OVER_CLAUSE > ::= OVER ( [PARTITION BY value_expression, ... [n] ]

PARTITION BY splits the result set into partitions. The window function is applied to each partition separately, and the calculation starts anew for each partition. If this clause is omitted, the function interprets the entire result set as a single group.

value_expression specifies the column on which to partition the rowset produced by the corresponding FROM clause. The value_expression argument can only reference columns that are accessible through the FROM clause. The value_expression argument cannot reference expressions or aliases in the choice list. A value_expression expression can be a column expression, a scalar nested query, a scalar function, or a custom variable.

The ORDER BY clause specifies the order for the ranking window function. If the ORDER BY clause is used in the context of a ranking window function, it can refer only to columns that are accessible through the FROM clause. You cannot specify the position of a column name or alias in a selection list by using an integer. The ORDER BY clause cannot work with aggregate window functions.

A single QUERY with a single FROM clause can use multiple statistical or ranking window functions. However, the OVER clause for each function can apply its own partitioning and ordering. The OVER clause cannot work with the CHECKSUM aggregate function.

The semantics of null values of window functions correspond to the semantics of null values of SQL aggregate functions.

Statistical window functions

Let's show an example of how statistical window functions are used.

Example. 16.4. Statistical window functions.

Suppose the CD has a fact table "Account Positions" (OrderDetail), containing the item number (OrderID), product identifier (ProductID), quantity of goods (OrderQt) and value of goods (Price). 

For the two items of accounts 43659 and 43664, let us calculate for each item sold the total quantity of goods sold, the average quantity of each item sold, the minimum and maximum quantity of goods sold.

The following query solves the task using window functions.

SELECT OrderID, ProductID, OrderQty ,SUM(OrderQty) OVER(PARTITION BY OrderID) AS 'Total' ,AVG(OrderQty) OVER(PARTITION BY OrderID) AS 'Average' ,COUNT(OrderQty) OVER(PARTITION BY OrderID) AS 'Quantity' ,MIN(OrderQty) OVER(PARTITION BY OrderID) AS 'Min' ,MAX(ORDERQTY) OVER(PARTITION BY OrderID) AS 'Max' FROM OrderDetail WHERE OrderID(43659,43664); GO

The result of the query is shown below.

Conclusion 3.

OrderIDProductIDOrderQtyItextAllMinMax
4365977612621216
4365977732621216
4365977812621216
4365977112621216
4365977212621216
436647721141814
436647754141814
436647141141814

Let the management of the organization demand to calculate the percentage of goods sold at item 43659. The following query uses window functions to solve the task.

SELECT OrderID, ProductID, OrderQty ,SUM(OrderQty) OVER(PARTITION BY OrderID) AS 'Total' ,CAST(1. * OrderQty / SUM(OrderQty) OVER(PARTITION BY OrderID) *100 AS DECIMAL(5,2))AS 'Percentage of item sold' FROM OrderDetail WHERE OrderID = 43659;

The result of the query is shown below.

Conclusion 4.

OrderIDProductIDOrderQtyAll Percentage sold
436597761263.85
4365977732611.54
436597781263.85
436597711263.85
436597721263.85

As you can see from the examples discussed, using the OVER clause is more efficient than using nested queries. The use of window ranking functions will be discussed in the next section.

Window function syntax

The general syntax for calling a window function is as follows:

window_function_name(expression) OVER ( [partition_defintion] [order_definition] [frame_definition] )

In this syntax:

  • First, specify the name of the window function, and then specify the expression.
  • Second, specify the OVER clause, which has three possible elements: section definition, order definition, and frame definition.

Opening and closing parentheses after the OVER clause are mandatory, even without the expression, for example:

window_function_name(expression) OVER()

partition_clause syntax

partition_clause breaks the lines into parts or partitions. The two sections are separated by a partition boundary.

The window function runs inside partitions and is reinitialized when the partition border is crossed.

 

The syntax partition_clause is as follows:

PARTITION BY [{,...}]

You can specify one or more expressions in a PARTITION BY clause. Multiple expressions are separated by commas.

 

order_by_clause syntax

order_by_clause has the following syntax:

ORDER BY [ASC| DESC], [{,...}]

The ORDER BY clause specifies how the rows in the section are ordered. You can organize the data within a partition into multiple keys, each key defined by an expression. Multiple expressions are also separated by commas.

Like this PARTITION BY clause, the ORDER BY clause is also supported by all window functions. However, it makes sense to use the ORDER BY clause only for order-sensitive window functions.

 

frame_clause syntax

The frame is a subset of the current partition. To define a subset, you use the frame clause as follows:

frame_unit {|}

A frame is defined relative to the current row, which allows a frame to move within a section based on the position of the current row in its section.

A frame unit defines the type of relationship between the current row and the rows in the frame. This can be ROWS or RANGE. The offsets of the current row and the frame line are the line numbers if the frame unit is equal to ROWS, and the row values are the RANGE frame unit.

frame_start and frame_between define frame boundaries.

frame_start does one of the following:

  • UNBOUNDED PRECEDING: The frame begins with the first row of the section.
  • N PRECEDING: The physical N of the rows before the first current line. N can be a literal number or an expression that is valued as a number.
  • CURRENT ROW: Current calculation line

frame_between is as follows:

BETWEEN frame_boundary_1 AND frame_boundary_2

Each frame_boundary_1 and frame_boundary_2 frame can contain one of the following:

  • frame_start: As mentioned earlier.
  • UNBOUNDED FOLLOWING: The frame ends in the last row of the section.
  • N FOLLOWING: The physical N rows after the current row.

If you do not specify frame_definition in the OVER clause, MySQL uses the following frame by default:

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

MySQL. List of window functions

The following table shows the window functions in MySQL:

NameDescription
CUME_DISTCalculates the cumulative distribution of a value in a set of values.
DENSE_RANKAssigns a rank to each row in its section based on the ORDER BY clause. It assigns the same rank to strings with the same values. If two or more rows have the same rank, there will be no spaces in the sequence of ranked values.
FIRST_VALUEReturns the value of the specified expression relative to the first string in the window frame.
WasReturns the value of the Nth row before the current row in the section. Returns NULL if no preceding string exists.
LAST_VALUEReturns the value of the specified expression relative to the last row in the window frame.
LEADReturns the value of the Nth row after the current row in the section. Returns NULL if no subsequent string exists.
NTH_VALUEReturns the value of the argument from the Nth line of the window frame
NTILEAllocates rows for each section of the window to the specified number of ranked groups.
PERCENT_RANKCalculates the percentile rank of a row in a section or result set
RANKThe DENSE_RANK() function is similar, except that there are spaces in a sequence of ranked values when two or more rows have the same rank.
ROW_NUMBERAssigns a sequential integer to each row in its partition
 

Window functions in MariaDB

Window management features were added to the ANSI/ISO SQL:2003 standard and later expanded in the ANSI/ISO SQL:2008 standard. DB2, Oracle, Sybase, PostgreSQL, and other products have had full implementations for years. Other manufacturers added support for window functions later. For example, when Microsoft introduced window functions in SQL Server 2005, it included only a few features, namely ROW_NUMBER, RANK, NTILE, and DENSE_RANK. Only SQL Server 2012 implemented the full range of window functions.

After numerous desires and requests for functions for windowed functions, they were finally introduced in MariaDB 10.2.0. MariaDB now includes window functions such as ROW_NUMBER, RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST, NTILE, COUNT, SUM, AVG, BIT_OR, BIT_AND, and BIT_XOR.

T-SQL Windowing and Performance

Window functions make it easier to write many queries and often provide better performance than older methods. For example, using the LAG function is significantly better than using a self-connection. However, to achieve better performance overall, it is necessary to understand the concepts of window functions, and how they use sorting to get the result.

OVER clause and sorting

There are two constructs in the OVER clause that can lead to sorting: ORDER BY and PARTITION BY. PARTITION BY is supported by all window functions, but is not required. ORDER BY is mandatory for most functions. Depending on what you are trying to accomplish, the data will be sorted according to the OVER clause, and this may be a performance bottleneck in your query.
For example, you want to apply the ROW_NUMBER function in the order SalesOrderID. The result will be different from, say, applying the function in descending order of TotalDue:
USE AdventureWorks2017; —or another version that you have installed
GO
SELECT SalesOrderID,TotalDue,ROW_NUMBER

() OVER(ORDER BY SalesOrderID) AS RowNum
FROM Sales.SalesOrderHeader;
SELECT SalesOrderID,
TotalDue,
ROW_NUMBER() OVER(ORDER BY TotalDue DESC) AS RowNum
FROM Sales.SalesOrderHeader;
Because the first query uses the cluster key in the ORDER BY option, no sorting is required.


The execution plan for the second quey contains an expensive sort operation.

ORDER BY in the OVER clause has nothing to do with the ORDER BY clause for the entire query. The following example shows what happens if they are different:
SELECT SalesOrderID,TotalDue,ROW_NUMBER

() OVER(ORDER BY TotalDue DESC) AS RowNum
FROM Sales.SalesOrderHeader
ORDER BY SalesOrderID;
The clustered index key is SalesOrderID, but the rows must first be sorted by TotalDue in descending order and then again by SalesOrderID. Take a look at the plan:


The PARTITION BY clause can also cause sorting. This is similar, though not entirely accurate, to the GROUP BY clause in aggregate queries. The following query numbers the rows for each tenant.
SELECT CustomerID,
SalesOrderID,
TotalDue,
ROW_NUMBER() OVER(PARTITION BY CustomerID ORDER BY SalesOrderID) AS RowNum
FROM Sales.SalesOrderHeader;


The execution plan shows only one sort by column combination, CustomerID and SalesOrderID.

One way to avoid performance degradation is to create an index specifically for the OVER offer. In his book Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions, Itzik Ben-Gan offers a POC index (POC – (P)ARTITION BY, (O)RDER BY, and (C)overing). It recommends that you add any columns used for filtering before the PARTITION BY and ORDER BY columns to the index key. Then, add any additional columns that are required to create the coverage index as included columns. As always, you should test how such an index will affect your query and overall load. Of course, you can't add indexes for every query you write. But if performance issues are caused by a specific query that uses a window function, it's important to take this tip into consideration.
Here is an index that improves the previous query:
CREATE NONCLUSTERED INDEX test ON Sales.SalesOrderHeader
(CustomerID, SalesOrderID)
INCLUDE (TotalDue);

Frames (FRAME)

In my opinion, the framework seems to be the most difficult concept to understand when studying window functions. The framework is required in the following cases:

    • Window units with ORDER BY, used to calculate cumulative totals, moving averages, etc.

 

    • FIRST_VALUE

 

  • LAST_VALUE

Fortunately, frameworks aren't always required, however, unfortunately, it's easy to overlook them using the default behavior. The default frame is always RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Although you'll get the correct results, if the ORDER BY clause contains a unique column or combination of columns, you'll see a performance spike.
Here's an example comparing the default window and the correct window:
SET STATISTICS IO ON;
GO
SELECT CustomerID,
SalesOrderID,
TotalDue,
SUM(TotalDue) OVER(PARTITION BY CustomerID ORDER BY SalesOrderID) AS RunningTotal
FROM Sales.SalesOrderHeader;
SELECT CustomerID,
SalesOrderID,
TotalDue,
SUM(TotalDue) OVER(PARTITION BY CustomerID ORDER BY SalesOrderID
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS RunningTotal
FROM Sales.SalesOrderHeader;
The results are the same, but the performance varies greatly. Unfortunately, the execution plan won't tell you the truth in this case. It allocates 50% of the resources to each request:


If you look at the I/O statistics, you will see the difference:

Using the correct window frame is even more important if the ORDER BY option is not unique or if you are using LAST_VALUE. In this example, ORDER BY uses the OrderDate column, and some customers place more than one order per day. If you do not specify window frames, or use RANGE, the function treats matching dates as part of the same window.
SELECT CustomerID,
SalesOrderID,
TotalDue,
OrderDate,
SUM(TotalDue) OVER(PARTITION BY CustomerID ORDER BY OrderDate) AS RunningTotal,
SUM(TotalDue) OVER(PARTITION BY CustomerID ORDER BY OrderDate
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS CorrectRunningTotal
FROM Sales.SalesOrderHeader
WHERE CustomerID IN (‘11433′,’11078′,’18758’);


SQL in Data Warehouses: Analytical Data Processing. T-SQL Window Functions

The reason for the discrepancy is that RANGE sees the data logically, while ROWS sees it positionally. There are 2 solutions to this problem. The first is to make the ORDER BY option unique. The second and more important way is to always set the frame where it is maintained.
Another area where frameworks cause logical problems is related to LAST_VALUE. LAST_VALUE returns an expression from the last line of the frame. Because the default value of the frame (RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) extends only to the current row, the last row of the frame is the row for which the calculations are performed. Here's an example:
SELECT CustomerID,SalesOrderID,TotalDue,LAST_VALUE


(SalesOrderID) OVER(PARTITION BY CustomerID ORDER BY SalesOrderID) AS LastOrderID,LAST_VALUE
(SalesOrderID) OVER(PARTITION BY CustomerID ORDER BY SalesOrderID
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS CorrectLastOrderID
FROM Sales.SalesOrderHeader
ORDER BY CustomerID, SalesOrderID;


SQL in Data Warehouses: Analytical Data Processing. T-SQL Window Functions

Window units

One of the most useful features of window functions is the ability to add aggregate expressions to non-aggregate queries. Unfortunately, this is often bad for performance. To understand the problem, you need to look at the I/O statistics, where you will see a large number of logical readings. If you want to return values with different details in a single query for a large number of rows, I recommend using one of the older methods, such as common table expressions (CTEs), temporary tables, or even variables. If it is possible to calculate the aggregates in advance before using window units, this is another way. Here is an example that demonstrates the difference between a window aggregation and another method:
SELECT SalesOrderID,TotalDue,SUM

(TotalDue) OVER() AS OverallTotal
FROM Sales.SalesOrderHeader
WHERE YEAR(OrderDate) =2013;
DECLARE @OverallTotal MONEY;
SELECT @OverallTotal = SUM(TotalDue)
FROM Sales.SalesOrderHeader
WHERE YEAR(OrderDate) = 2013;
SELECT SalesOrderID,
TotalDue,
@OverallTotal AS OverallTotal
FROM Sales.SalesOrderHeader AS SOH
WHERE YEAR(OrderDate) = 2013;


The first query scans the table only once, but it has 28823 logical reads in the worksheet. The second method scans the table twice, but it doesn't need a worksheet.
The following example uses a window aggregation applied to an aggregate expression:
SELECT YEAR(OrderDate) AS OrderYear,SUM
(TotalDue) AS YearTotal,SUM
(TotalDue)/
SUM(SUM(TotalDue)) OVER() * 100 AS PercentOfSales
FROM Sales.SalesOrderHeader
GROUP BY YEAR(OrderDate)
ORDER BY OrderYear;


When using window functions in an aggregation query, the expression must follow the same rules as the SELECT and ORDER BY clauses. In this case, the window function is applied to SUM(TotalDue). This looks like a nested aggregate, but in fact the window function is applied to the aggregate.
Because the data was aggregated before the window function was applied, the query shows good performance:


There is a more interesting thing that you need to know about using aggregate functions. If you use multiple expressions that have the same definition of the OVER clause, you won't see an additional performance drop.
I advise you to use this functionality with caution. It's very convenient, but it doesn't scale as well.

Performance Comparison

Until now, the examples have used a small table, Sales.SalesOrderHeader, from the AdventureWorks database, and have focused on the execution plan and logical reads. In real life, your customers are not bothered by either the execution plan or the logical readings; they care how quickly the request is executed. To see the difference in execution time, I use Adam Machanic's script with some modifications.
The script creates a table named bigTransactionHistory that contains over 30 million rows. After running Adam's script, I created two more copies of this table with 15 and 7.5 million rows, respectively. I also enabled the Discard results after execution option in the query results parameters so that grid filling does not affect run time. I ran the test three times and cleared the buffer cache before each run.
The following is a script to create additional test tables:


SELECT TOP(50) Percent *
INTO mediumTransactionHistory
FROM bigTransactionHistory;
SELECT TOP(25) PERCENT *
INTO smallTransactionHistory
FROM bigTransactionHistory;
GO
ALTER TABLE mediumTransactionHistory
ALTER COLUMN TransactionID INT NOT NULL;
GO
ALTER TABLE mediumTransactionHistory
ADD CONSTRAINT pk_mediumTransactionHistory PRIMARY KEY (TransactionID);
GO
ALTER TABLE smallTransactionHistory
ALTER COLUMN TransactionID INT NOT NULL;
GO
ALTER TABLE smallTransactionHistory
ADD CONSTRAINT pk_smallTransactionHistory PRIMARY KEY (TransactionID);
GO
CREATE NONCLUSTERED INDEX IX_ProductId_TransactionDate
ON mediumTransactionHistory
(
ProductId,
TransactionDate
)
INCLUDE
(
Quantity,
ActualCost<br>);
CREATE NONCLUSTERED INDEX IX_ProductId_TransactionDate
ON smallTransactionHistory
(
ProductId,
TransactionDate
)
INCLUDE
(
Quantity,ActualCost
<br>);
" I can't give you all the arguments about how important it is to use a frame if it supports it. To see the difference, I ran a cumulative total test using four methods:</br>);
</br>);
    • Cursor-based solution
    • Correlating subquery
    • Window function with default frame

 

  • Window function with ROWS

I ran the test on three new tables. Below are the results in the form of graphs:

In the case of the ROWS frame on a table with 7.5 million rows, the execution request took less than a second. On a table of 30 million rows, the startup time was about a minute.
Here's a query using the ROWS frame that ran on a table of 30 million rows:
SELECT ProductID, SUM(ActualCost) OVER(PARTITION BY ProductID ORDER BY TransactionDate
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS RunningTotal
FROM bigTransactionHistory;
I also ran tests to see how window units behave compared to traditional methods. In this case, I just used a table of 30 million rows, but performed one, two, or three calculations using the same level of detail and therefore the same OVER clause. The performance of window units was compared with CTE and correlated subquery.


The performance of the window units was the worst – about 1.75 minutes in each case. The CTE performed better when increasing the number of calculations, as the table was read only once for all three calculations. The performance of the correlating subquery deteriorated as the number of calculations increased, as each calculation had to be performed separately, resulting in four times to the table.
Here is the winning request:


WITH Calcs AS (
SELECT PRODUCTID,AVG
(ActualCost) AS AvgCost,MIN
(ActualCost) AS MinCost,MAX
(ActualCost) AS MaxCost
FROM bigTransactionHistory
GROUP BY ProductID)
SELECT O.ProductID,ActualCost,AvgCost,MinCost,MaxCost




FROM bigTransactionHistory AS O
JOIN Calcs ON O.ProductID = Calcs.ProductID;

Conclusion

 

T-SQL windowing functions were promoted as a means of significantly improving performance. In my opinion, they make it easier to write queries. And you should understand them to get good performance. Indexing matters, but you can't create an index for each query. Frameworks are harder to understand, but it becomes very important for large tables.