# Trees in SQL

A tree is a special kind of directional graph. Graphs are data structures consisting of nodes connected by arcs. Each arc shows unidirectional communication between two nodes. In an organization chart, nodes are employees, and each arc describes subordinations. In the list of materials, the nodes are modules (ultimately shown down to individual parts), and arcs describe the relationship "made of".

The top of the tree is called the root. In the organization chart, this is the biggest boss; in the list of materials, this is the assembled part. A binary tree is a tree in which a node can have no more than two descendants; In general, an n-dimensional tree is one in which a node can have no more than n nodes of descendants.

Tree nodes that do not have subtrees are called leaves. In the list of materials, these are the minimum parts into which the part can be disassembled. The descendants, or children, of the parent node are all nodes in the subtree that has the parent node rooted.

Trees are often depicted as diagrams. (See Figure 1) Another way to represent trees is to show them as nested sets (see Figure 2); This is the basis for my SQL representation of trees as nested sets.

In SQL, any relationship is explicitly described by the data. A typical way to represent trees is to place an adjacency matrix in a table. That is, one column is the parent node, and the other column in the same row is the child node (the pair is an arc in the graph). For example, consider an organizational chart of a company with six employees:

CREATE TABLEPersonnel( empCHAR(20) PRIMARY KEY, bossCHAR(20) REFERENCESPersonnel(emp), salaryDECIMAL(6,2) NOT NULL);Personnel:emp boss salary ========================== 'Jerry' NULL 1000.00 'Bert' 'Jerry' 900.00 'Chuck' 'Jerry' 900.00 'Donna' 'Chuck' 800.00 'Eddie' 'Chuck' 700.00 'Fred' 'Chuck' 600.00

This model has advantages and disadvantages. The PRIMARY KEY is emp, but the boss column is functionally dependent on it, hence we have problems with normalization. REFERENCES will not give you the opportunity to indicate to the boss who is not an employee. However, what happens when 'Jerry' changes the name to 'Geraldo' to get a TV talk show? You should also make cascading changes to the 'Bert' and 'Chuck' lines.

Another disadvantage of this model is that it is difficult to deduce the path. To find the boss name for each employee, use a self-connecting query, such as:

SELECTB1.emp, 'bosses', E1.empFROMPersonnelASB1, PersonnelASE1WHEREB1.emp = E1.boss;

But something is missing here. This request only gives you immediate supervisors. Your boss's boss also has power over you, and so on up the tree. To display two levels in the tree, you need to write a more complex self-association query, such as:

SELECTB1.emp, 'bosses', E2.empFROMPersonnelASB1, PersonnelASE1, PersonnelASE2WHEREB1.emp = E1.bossANDE1.emp = E2.boss;

To go more than two levels deeper in the tree, simply expand the sample:

SELECTB1.emp, 'bosses', E3.empFROMPersonnelASB1, PersonnelASE1, PersonnelASE2, PersonnelASE3WHEREB1.emp = E1.bossANDE1.emp = E2.boss AND E2.emp = E3.boss;

Unfortunately, you have no idea how deep the tree is, so you should keep expanding this query until you end up with an empty set.

The leaves have no descendants. In this model, they are fairly easy to find: These are employees who are not the boss of anyone else in the company:

SELECT*FROMPersonnelASE1WHERE NOT EXISTS(SELECT*FROMPersonnelASE2WHEREE1.emp = E2.boss);

At the root of the boss tree is NULL:

SELECT*FROMPersonnelWHEREbossIS NULL;

Real problems arise when you try to calculate values up and down the tree. As an exercise, write a query summarizing the salary of each employee and his/her subordinates; result:

Total Salariesemp boss salary ========================== 'Jerry' NULL 4900.00 'Bert' 'Jerry' 900.00 'Chuck' 'Jerry' 3000.00 'Donna' 'Chuck' 800.00 'Eddie' 'Chuck' 700.00 'Fred' 'Chuck' 600.00

## Multiple tree model.

Another way to represent trees is to show them as nested sets. This is a more suitable model, because .SQL is a language focused on sets. The root of a tree is a set containing all other sets, and the ancestor-child relationship is described by the belonging of the set of descendants to the set of ancestors.

There are several ways to convert an organization chart to nested sets. One way is to imagine that you are moving subordinate "ovals" inside their parents, using the edge lines as ropes. The root is the largest oval and contains all the other nodes. The leaves are the innermost ovals, containing nothing inside, and the nesting corresponds to hierarchical relationships. This is a natural representation of the "list of materials" model, because the final block is made physically from nested components, and is disassembled into separate parts.

Another approach is to imagine a small worm crawling along the "nodes and arcs" of a tree. The worm starts from above, with the root, and makes a complete trip around the tree.

But now let's imagine a stronger worm with a counter that starts with one. When the worm arrives at the node, it places a number in the cell on the side it visited and increments the counter. Each node will receive two numbers, one for the right side and one for the left side.

This produces predictable results that you can use to generate queries. The Personnel table looks like this, with left and right numbers in the form:

CREATE TABLEPersonnel( empCHAR(10) PRIMARY KEY, salaryDECIMAL(6,2) NOT NULL, leftINTEGER NOT NULL, rightINTEGER NOT NULL);Personnelemp salary left right ============================== 'Jerry' 1000.00 1 12 'Bert' 900.00 2 3 'Chuck' 900.00 4 11 'Donna' 800.00 5 6 'Eddie' 700.00 7 8 'Fred' 600.00 9 10

The root always has 1 in the left column and twice the number of nodes (2*n) in the right column. It's easy to understand: the worm must visit each node twice, once on the left side and once on the right side, so the final number must be twice the number of nodes in the entire tree.

In the nested sets model, the difference between left and right leaf values is always 1. Imagine a worm turning around a leaf as it crawls across the tree. Therefore, you can find all the leaves by the following simple query:

SELECT*FROMPersonnelWHERE(right - left) = 1;

You can use this trick to speed up queries: build a unique index on the left column, then rewrite the query to take advantage of the index:

SELECT*FROMPersonnelWHEREleft = (right - 1);

The reason for the performance gain is that SQL can use the index on the left column when it is not corrupted in the expression. Do not use (left - right) = 1, because it allows you to take advantage of the index.

In the nested-multiplier model, paths are shown as nested sets, which are represented by nested set numbers and BETWEEN predicates. For example, to identify all the bosses of a particular employee, you need to write:

SELECT:myworker, B1.emp, (right - left)ASheightFROMPersonnel AS B1, PersonnelASE1WHEREE1.left BETWEEN B1.leftANDB1.rightANDE1.right BETWEEN B1.leftANDB1.rightANDE1.emp = :myworker;

The higher the height, the further down the hierarchy the boss is from the employee. The nested sets model uses the fact that each containing other sets is larger in size (where size = (right - left)) than the sets it contains. Obviously, the root will always have the largest size.

The level, the number of arcs between two given nodes, is fairly simple to calculate. For example, to find the levels between a given worker and a manager, you could use:

SELECTE1.emp, B1.emp,COUNT(*) - 1ASlevelsFROMPersonnelASB1, PersonnelASE1WHEREE1.leftBETWEENB1.leftANDB1.rightANDE1.rightBETWEENB1.leftANDB1.rightANDE1.node = :myworkerANDB1.node = :mymanager;

(COUNT(*) - 1) is used to remove a node's double index directly as being at another level, because a node is zero levels removed from itself.

You can build other queries from this template. For example, to find the common bosses of two employees, combine paths and find nodes that have (COUNT(*) > 1). To find the closest common ancestors of the two nodes, combine paths, find nodes that have (COUNT(*) > 1), and select with the lowest depth.

Figure 1. | |

The top of the tree is called the root. Tree nodes that do not have subtrees are called leaves. The descendants of the parent node are the nodes in the subdents that have the root parent node. | |

Figure 2. | |

Another way to represent trees is to show them as nested sets. This is a more suitable model, because .SQL is a language focused on sets. The root of a tree is a set containing all other sets, and the ancestor-child relationship is described by the belonging of the set of descendants to the set of ancestors. |

I assume you have a March 1996 article in front of you, so I won't repeat myself. The multiple tree model has some properties that I didn't mention last month. But first, let's create a table (see Listing 1) to store personnel information. I will refer to this table everywhere in the rest of this article.

The tree in Figure 1 is represented as A) a graph and B) nested sets. The direction is shown by attachment, that is, you know that someone is a subordinate of someone else in the hierarchy of the company, if that left and right numbers of the set of this person - between those of their bosses.

Another property that I didn't mention last time is that descendants are ordered, i.e. you can use the numbers of the elements of the set to arrange the descendants. This property is absent in the model of the adjacency matrix, which has no ordering among descendants. You can use this fact when inserting, updating, or deleting nodes in the tree.

One of the properties of a tree is that it is a graph without loops. That is, no path closes to catch you in an endless loop when you follow them in a tree. Another property is that there is always a path from the root to any other node in the tree. In the nested sets model, paths are shown as nested sets, which are represented by set numbers and BETWEEN predicates. For example, to find out all the managers to whom the worker should report, you can write:

SELECT'Mary', P1.emp, (P1.rgt - P1.lft)ASsizeFROMPersonnelASP1, PersonnelASP2WHEREP2.emp = 'Mary'ANDP2.lftBETWEENP1.lftANDP1.rgt; Mary emp size ==== === ==== Mary Albert 27 Mary Charles 13 Mary Fred 9 Mary Jim 5 Mary Mary 1

Note that when size = 1, you are dealing with Mary as her own boss. You can rule out this case.

The nested set model uses the fact that each set has a larger size (size = right - left) than the sets it contains. Obviously, the root will always have the largest size. JOIN and ORDER BY are not needed in nested sets models like adjacency graph models. Plus, the results do not depend on the order in which the rows are displayed.

Node level - The number of arcs between the node and the root. You can calculate the node level with the following query:

SELECTP2.emp,COUNT(*)ASlevelFROMPersonnelASP1, PersonnelASP2WHEREP2.lftBETWEENP1.lft AS P2GROUPBY P2.emp;

This query finds levels among managers, as follows:

emp level === ===== Albert 1 Bert 2 Charles 2 Diane 2 Edward 3 Fred 3 George 3 Heidi 3 Igor 4 Jim 4 Kathy 4 Larry 4 Mary 5 Ned 5

In some books on graph theory, the root has a zero level instead of the first. If you like this agreement, use the expression "(COUNT(*)-1)".

Self-association in combination with the predicate BETWEEN is the main template for other queries.

## Aggregate functions in trees.

Getting a simple amount of salary to the manager's subordinates works on the same principle. Note that this total amount will also include the boss's salary:

SELECTP1.emp,SUM(P2.salary)ASpayrollFROMPersonnelASP1, PersonnelASP2WHEREP2.lftBETWEENP1.lftANDP1.rgtGROUP BYP1.emp; emp payroll === ======= Albert 7800.00 Bert 1650.00 Charles 3250.00 Diane 1900.00 Edward 750.00 Fred 1600.00 George 750.00 Heidi 1000.00 Igor 500.00 Jim 300.00 Kathy 100.00 Larry 100.00 Mary 100.00 Ned 100.00

## Delete subtrees

The next request will take a fired employee as a parameter and delete the subtree below him/her. The catch in this query is that you are using a key, but you have to make the left and right values work. The answer is a set of scalar subqueries:

DELETE FROMPersonnelWHERElftBETWEEN(SELECTlftFROMPersonnelWHEREemp = :downsized)AND(SELECTrgtFROMPersonnelWHEREemp = :downsized);

The problem is that after this query, there are gaps in the sequence of numbers of sets. This does not prevent most tree queries from being made because the attachment property is preserved. This means that you can use the BETWEEN predicate in your queries, but other operations that depend on the density of numbers will not work in the tree at intervals. For example, you will not be able to find leaves using the predicate (right-left=1), and you will not be able to find the number of nodes in a subtree using the left and right values of its root.

Unfortunately, you have lost information that will be very useful in closing those gaps - namely the correct and left numbers of the root of the subtree. Therefore, forget the request, and write instead the procedure:

CREATE PROCEDUREDropTree (downsizedIN CHAR(10)NOT NULL)BEGIN ATOMICDECLAREdropempCHAR(10)NOT NULL;DECLAREdroplftINTEGER NOT NULL;DECLAREdroprgtINTEGER NOT NULL;SELECTemp, lft, rgtINTOdropemp, droplft, droprgtFROMPersonnelWHEREemp = downsized;DELETE FROMPersonnelWHERElftBETWEENdroplftanddroprgt;UPDATEPersonnelSETlft =CASEWHENlft > droplfTHENlft - (droprgt - droplft + 1)ELSElftEND, rgt =CASEWHENrgt > droplftTHENrgt - (droprgt - droplft + 1)ELSErgtEND;END;

The actual procedure should have error handling, but I leave it as an exercise for the reader.

## Delete p a single node

Removing a single node in the middle of a tree is harder than removing full subtrees. When you remove a node in the middle of a tree, you have to decide how to fill the hole. There are two ways. The first method k elevates one of the children to the position of the original node (suppose the father dies and the eldest son takes over the business, as shown in Figure 2). The oldest child is always shown as the leftmost child node under the parent. However, there is a problem with this operation. If the older child has children of his own, you have to decide how to handle them, and so on down the tree until you get to the leaf.

The second method for removing a single node in the middle of the tree is to connect the descendants to the ancestor of the original node (you can say that the mom dies and the children are accepted by the grandmother), as shown in Figure 3. This is obtained automatically in the nested sets model, you just delete the node and its children are already contained in the nodes of its ancestor. However, you have to be careful when closing the gap left by erasing. There is a difference in changing the numbering of the descendants of the remote node and changing the numbering of the nodes on the right. Below is a procedure that does this:

CREATE PROCEDUREDropNode (downsizedIN CHAR(10)NOT NULL)BEGIN ATOMICDECLAREdropempCHAR(10)NOT NULL;DECLAREdroplftINTEGER NOT NULL;DECLAREdroprgtINTEGER NOT NULL;SELECTemp, lft, rgtINTOdropemp, droplft, droprgtFROMPersonnelWHEREemp = downsized;DELETE FROMPersonnelWHEREemp = downsized;UPDATEPersonnelSETlft =CASEWHENlftBETWEENdroplftANDdroprgtTHENlft - 1WHENlft > droprgtTHENlft - 2ELSElftENDrgt =CASEWHENrgtBETWEENdroplftANDdroprgtTHENrgt - 1WHENrgt > droprgtTHENrgt -2ELSErgtEND;WHERElft > droplft;END;

Листинг 1CREATE TABLEPersonnel (empCHAR(10)PRIMARY KEY, salaryDECIMAL(8,2)NOT NULL CHECK(salary > = 0.00), lftINTEGER NOT NULL, rgtINTEGER NOT NULL,CHECK(lft < rgt));INSERT INTOPersonnelVALUES('Albert', 1000.00, 1, 28);INSERT INTOPersonnelVALUES('Bert', 900.00, 2, 5);INSERT INTOPersonnelVALUES('Charles', 900.00, 6, 19);INSERT INTOPersonnelVALUES('Diane', 900.00, 20, 27);INSERT INTOPersonnelVALUES('Edward', 750.00, 3, 4);INSERT INTOPersonnelVALUES('Fred', 800.00, 7, 16);INSERT INTOPersonnelVALUES('George', 750.00, 17, 18);INSERT INTOPersonnelVALUES('Heidi', 800.00, 21, 26);INSERT INTOPersonnelVALUES('Igor', 500.00, 8, 9);INSERT INTOPersonnelVALUES('Jim', 100.00, 10, 15);INSERT INTOPersonnelVALUES('Kathy', 100.00, 22, 23);INSERT INTOPersonnelVALUES('Larry', 100.00, 24, 25);INSERT INTOPersonnelVALUES('Mary', 100.00, 11, 12);INSERT INTOPersonnelVALUES('Ned', 100.00, 13, 14);

Removing a single node in the middle of a tree is harder than removing full subtrees. When you remove a node in the middle of a tree, you have to decide how to fill the hole. There are two ways. The first method k raises one of the children to the position of the original node (suppose the father dies and the oldest son takes over the business. The oldest descendant is always shown as the far-left child node under the parent.

The second method for removing a single node in the middle of the tree is to connect the offspring to the ancestor of the original node (you can say that mommy dies and the children are accepted by the grandmother).

People ask me why I don't show more procedural code in the examples. Right now, ANSI and ISO are trying to negotiate a standard procedural language for triggers and stored procedures called SQL/PSM. However, this standard has not yet been approved, which means that I have to use either my own pseudo-code or choose one manufacturer. I've decided to use English comments for now, but I'll start using SQL/PSM when it's complete.

The trickiest part of tree processing in SQL is finding a way to convert the adjacency matrix model into a nested set model within a pure SQL structure. It would be fairly simple to load the adjacency matrix table into the program, and then use the recursive tree transformation program from the college textbook to build a model of nested sets.

Honestly, such a tree transform could be faster than what I'm going to show you. But I want to do it in pure SQL to prove the following: You can do in a declarative language what you can do in a procedural language. Since this is an instructional exercise, I will explain in painful detail.

The classic approach to solving a problem is to take the simplest case of the problem, and see if you can apply it to more complex cases. If the tree has no nodes, then the transformation is simple - do nothing. If the tree has one node, then the conversion is simple - set the left value to 1 and the right value to 2. The nature of the adjacency matrix is such that you can only move one level at a time, so let's look at a tree with two levels – the root and immediate descendants. The adjacency model table would resemble the following:

CREATE TABLEPersonnel (empCHAR(10)NOT NULL PRIMARY KEY, bossCHAR(10));Personnelemp boss ================= 'Albert' NULL 'Bert' 'Albert' 'Charles' 'Albert' 'Diane' 'Albert'

Let's put the nested set model in its own worksheet:

CREATE TABLEWorkingTree( empCHAR(10), bossCHAR(10), lftINTEGER NOT NULL DEFAULT0, rgtINTEGER NOT NULL DEFAULT0);

From the previous paragraphs of this article, you know that the root of the tree has a left value of 1, and that the right value is twice the number of nodes in the tree. In a worksheet, let the boss column always contain the root key value of the original tree. In reality, this will be the name of the nested set:

INSERT INTOWorkingTree--convert the root nodeSELECTP0.boss, P0.boss, 1, 2 * (SELECTCOUNT(*) + 1FROMPersonnelASP1WHEREP0.boss = P1.boss)FROMPersonnelASP0;

Now, you have to add descendants to the nested sets table. The original boss will remain the same. The order of descendants is the natural order of the key; in this case, emp char(10):

INSERT INTOWorkingTree--convert the childrenSELECT DISTINCTP0.emp, P0.boss, 2 * (SELECT COUNT(DISTINCTemp)FROMPersonnel AS P1WHEREP1.emp < P0.empANDP0.bossIN(P1.emp, P1.boss)), 2 * (SELECT COUNT(DISTINCTemp)FROMPersonnelASP1WHEREP1.emp < P0.empANDP0.bossIN(P1.boss, P1.emp)) + 1FROMPersonnelASP0;

In fact, you can use this procedure to convert an adjacency matrix model into a forest of trees, each of which is a nested set model identified by its root value. Thus, the Albert family tree is a set of rows that have Albert as an ancestor, a Bert family tree is a set of rows that have Bert as an ancestor, and so on. (This concept is illustrated in Figures 1 and 2.

Because the original adjacency matrix table replicates leaf nodes, non-root nodes, in the emp and boss columns, the WorkingTree table duplicates nodes as the root in one tree and as a child in another. The query will also behave strangely with a null value in the boss column of the original table of the adjacency matrix, so you will have to clear the WorkingTree table with the following query:

DELETE FROMWorkingTreeWHEREbossIS NULL ORempIS NULL;

To make these trees merge into one final tree, you need a way to attach the subordinate tree to its ancestor. In procedural language, you could accomplish this with a program that would do the following steps:

- Find the size of the slave tree.
- Find the place where the subordinate tree is inserted into the ancestor tree.
- Push the ancestor tree apart at the insertion point.
- Insert the slave tree at the insertion point.

In non-procedural language, you would perform these steps together, using the logic of all the points listed. You begin this process by asking questions and noting the facts:**Q)**How do you choose an ancestor tree and its subordinate tree in the forest?**A)** Looking for a single key value that is a descendant in the ancestor tree and the root of the subordinate tree;**Q)**How to determine how much to expand the ancestor tree?**A)**This is the size of the slave tree equal to (2 * (select count(*) from Subordinate)).**Q)**How do I determine the insertion point?**A)**This is a row in the ancestor table where the emp value is equal to the boss value in the slave table. You want to place the slave tree to the left left of that shared node. Small algebraic calculations give you a number appended to all left and right values to the right of the insertion point.

The easiest way to explain this is with the help of the table of relations shown in Table 1.

**TRANSLATOR'S NOTE:** I don't know what he meant about the simplest way of explanation, but I didn't understand a damn thing in this table :)))) If you understand everything, then explain to me, pls, by letter :)))

You're ready to write a procedure that combines two trees:

CREATE PROCEDURETreeMerge(superiorNOT NULL, subordinateNOT NULL)BEGINDECLAREsizeINTEGER NOT NULL;DECLAREinsert_pointINTEGER NOT NULL;SETsize = 2 * (SELECT COUNT(*)FROMWorkingTreeWHEREemp = subordinate);SETinsert_point = (SELECT MIN(lft)FROMWorkingTreeWHEREemp = subordinateANDboss = superior) - 1;UPDATEWorkingTreeSETboss =CASE WHENboss = subordinateTHEN CASE WHENemp = subordinateTHEN NULLELSEsuperiorENDELSEbossEND, lft =CASE WHEN(boss = superiorANDlft > size)THENlft + sizeELSE CASE WHENboss = subordinateANDemp <> subordinateTHENlft + insert_pointELSElftENDEND, rgt =CASE WHEN(boss = superiorANDrgt > size)THENrgt + sizeELSE CASE WHENboss = subordinateANDemp <> subordinateTHENrgt + insert_pointELSErgtENDENDWHEREbossIN(superior, subordinate);

--Remove redundant copies of the child tree root

DELETE FROMWorkingTreeWHEREbossIS NULL ORempIS NULL;END;

It is very easy to detect pairs of external and subordinate trees in a WorkingTree table. The following query becomes empty when all bosses are set to the same value:

CREATE VIEWAllPairs (superior, subordinate)ASSELECTW1.boss, W1.empFROMWorkingTree AS W1WHERE EXISTS(SELECT*FROMWorkingTreeASW2WHEREW2.boss = W1.emp)ANDW1.boss <> W1.emp;

But you would like to get only one pair that you will transfer to the newly developed procedure. To move one pair, take the leftmost pair from the previous query:

CREATE VIEWLeftmostPairs(superior, subordinate) ASSELECT DISTINCTsuperior, (SELECT MIN(subordinate)FROMAllPairsASA2WHEREA1.superior = A2.superior)FROMAllPairsASA1WHEREsuperior = (SELECT MIN(superior)FROMAllPairs);

Now all you have to do is put this request into a previously developed procedure – and you will have a procedure that will merge together a forest of trees from left to right. Using a WHILE loop that monitors the presence of values in LeftmostPairs, make procedure calls. This is the only procedural structure in the entire stored procedure.

**Table 1 Resource requirements by component**

C1 | row in superior | y | y | y | y | n | y | n |

C2 | row in subord | n | n | n | n | y | y | n |

C3 | lft > cut | n | n | y | y | - | - | - |

C4 | rgt > cut | n | y | n | y | - | - | - |

A1 | Error | 1 | 1 | |||||

A2 | lft := lft + size | 1 | ||||||

A3 | rgt := rgt + size | 1 | 2 | |||||

A4 | lft := lft | 1 | 2 | 1 | ||||

A5 | rgt := rgt | 2 | 2 | |||||

A6 | lft := lft + cut | 1 | ||||||

A7 | rgt := rgt + cut | 2 |