Monday, May 26, 2008

Hierarchical data in MySQL (and other RDBMS)

Introduction

There are lot of cases we want to store hierarchical data into relational database instead of hierarchical ones like XML databases. Several approaches exist and are already well explained, the most well known are the adjacency list and the nested set models. After briefly introducing those models I will present an extension to the first one making it much more usable. This extension has already been presented by Joe Celko under the name Path enumeration

The adjacency list model

The most common way to store such data is using the adjacency list model as introduced by former IBM fellow Edgar F. Codd (the father of relational database theory):

idnameboss
1AnneNULL
2Bernard1
3Charlie1
4Delphine3
5Elodie3
6Fanny3
7Georges5

Such model respects fully the relational idea based on primary and foreign keys, but this system shows his limit when we have to retrieve the full path to a node: "Anne > Charlie > Elodie > Georges" or when we need all people working below Charlie. Such question typically requires recursion with many queries which may become very slow:

-- Path to Georges?
SELECT * FROM people WHERE id = 7; -- boss = 5
SELECT * FROM people WHERE id = 5; -- boss = 3
SELECT * FROM people WHERE id = 3; -- boss = 1
SELECT * FROM people WHERE id = 1; -- boss = NULL (STOP)

-- People under Charlie?
SELECT * FROM people WHERE boss IN (3); -- id = 4,5,6
SELECT * FROM people WHERE boss IN (4,5,6); -- id = 7
SELECT * FROM people WHERE boss IN (7); -- no results (STOP)

The nested set model

The second popular approach is to make use of the depth first traversal algorithm to assign a left and right number to any node of the tree:

idnamelftrgt
1Anne114
2Bernard23
3Charlie413
4Delphine56
5Elodie710
6Fanny1112
7Georges89

This model is algorithmically beautiful! It solves in an elegant way the problem of running multiple queries to retrieve hierarchical information:

-- Path to Georges?
SELECT * FROM people WHERE 8 BETWEEN lft AND rgt ORDER BY lft;

-- People under Charlie?
SELECT * FROM people WHERE lft BETWEEN 4 AND 13 ORDER BY lft;

While this model is very good at retrieving hierarchical information, it is somewhat more complex and less competitive at updating the hierarchy (inserting, moving or removing nodes) since it requires to update several records even if it is feasible in one query. Not to mention that this model is not relational at all and does not prevent through integrity constraints the accidental removal of a boss! However, this is easily circumvented by adding the boss column and then mixing both models.

Path enumeration model

The path enumeration model is based on the adjacency list one, the idea is quite simple: add a column materializing the full path (which is unique) to your node.

idnamebosspath
1AnneNULL/1/
2Bernard1/1/2/
3Charlie1/1/3/
4Delphine3/1/3/4/
5Elodie3/1/3/5/
6Fanny3/1/3/6/
7Georges5/1/3/5/7/

The path is always computed by taking the path of the parent node concatenated with "<ID>/". Retrieving the path to Georges or people working for Charlie is a kid's game:

-- Path to Georges?
SELECT path FROM people WHERE id = 7; -- path = /1/3/5/7/
SELECT * FROM people WHERE id IN (1,3,5,7) ORDER BY path;

-- People under Charlie?
SELECT * FROM people WHERE path LIKE '/1/3/%' ORDER BY path;

To benefit from all the speed of this solution, you should have a UNIQUE KEY on your path column. Unfortunately it isn't (yet?) possible to know the assigned auto-increment ID inside an insert-trigger using MySQL, it is then mandatory to work in two phases while inserting a record. A first solution is to insert dummy data in the path while taking care of the UNIQUE KEY constraint and then updating the record using LAST_INSERT_ID(). My preference goes to using another table (people_tree) with a 1..1 relation. Here is an example:

-- Adding Helena below Bernard
INSERT INTO people (name, boss) VALUES ('Helena', 2);
INSERT INTO people_tree
SELECT p.id, CONCAT(pt.path, p.id, '/') FROM people p JOIN people_tree pt ON p.boss = pt.id WHERE p.id = LAST_INSERT_ID();

Of course, this complexity may be hidden in a stored procedure.

6 comments:

Anonymous said...

Am I missing something or does the path enumeration model lack a clear method of sorting the tree when you don't want it in the order that items were added?

Patrick Allaert said...

@anonymous: you are right, you have the choice to use an additional column for this or defining a fixed length (padding with 0's) for your ID in the path. This way you either sort on the additional column or on the path directly.

Anonymous said...

Hi,
I have a few question about the path enumeration model :

1/ Is there an error in the path enumeration model query :

-- People under Charlie?
SELECT * FROM people WHERE path LIKE '/1/3/%' ORDER BY lft;


?
The order column should be 'path' ?

2/ What happen when you have ids over 9 ? For example, when you have to order the 'path' column, how does mysql reacts when ordering values like 1/8/1, 1/9/1 and 1/10/1 ?
The result will be
1/10/1
1/8/1
1/9/1
so it is incorrect ?

3/ Please, can you give a query example for the question about ordering in comments ?

Thank you

Patrick Allaert said...

Hi anonymous,

1/ Yes, there was an error in the order by, it has been corrected. Thanks for the notice.

2/ The idea behind ordering on the path is to group somewhat the information, not to have a "correct" priority between sub-nodes.
For example:
/1/3/
/1/3/11/
/1/3/4/
/1/3/5/
/1/3/5/7/
/1/3/5/7/10/
/1/3/5/7/13/
/1/3/5/7/9/
/1/3/6/
/1/3/6/8/
/1/3/6/12/

As you see, nodes are not sorted by time of insert (ID), but rather grouped by node. It might me interesting while processing the information to know that children of a node always follows directly the parent. Take for example: /1/3/5/7/ which is directly followed by their children.

If you really need an ordering mechanism depending on the time record are inserted, you may want to have a fixed length for ID's padded with 0's in the path. Taking my previous example:

/001/003/
/001/003/004/
/001/003/005/
/001/003/005/007/
/001/003/005/007/009/
/001/003/005/007/010/
/001/003/005/007/013/
/001/003/006/
/001/003/006/008/
/001/003/006/012/
/001/003/011/

3/ Ordering on a path with ID's padded with 0's might not be the most flexible solution, especially if you consider changing the order of sub-nodes. You may consider adding an additional "priority" column that gives you the order of nodes below a common parent. You might then order by: "path, priority".
Example:

/1/3/ (1)
/1/3/4/ (1)
/1/3/5/ (2)
/1/3/5/7/ (1)
/1/3/5/7/13/ (1)
/1/3/5/7/9/ (2)
/1/3/5/7/10/ (3)
/1/3/11/ (3)
/1/3/6/ (4)
/1/3/6/12/ (1)
/1/3/6/8/ (2)

Anonymous said...

Hi, it's anonymous again ^^
Thank you for your answer.

I agree with you for the last example, but if you use 'path' column to order, the 'priority' column will have no effect because all paths are differents.

Maybe to make this work, the path column shouldn't have the last path level (where you repeat the line id).
Like this all elements that belongs to the same parent have the same path, and this would allow an 'order' query to work on a second column ?

Minesh said...

@Anonymous : I know I am posting this after 3 years, but it might be helpful to others.

Just storing the the path up-to parent will not solve the issue.

See if you remove the last level (line id) from path below will be the result:

/1/ (1)
/1/3/ (1)
/1/3/ (2)
/1/3/ (3)
/1/3/ (4)
/1/3/5/ (1)
/1/3/6/ (1)
/1/3/6/ (2)
/1/3/5/7/ (1)
/1/3/5/7/ (2)
/1/3/5/7/ (3)

Obviously this is not desired result, as it first lists all the level 1 nodes, then level 2 nodes and so on.

We need the tree to be printed in right order, from parent to leaf in that order.

To achieve this, we need to use the combination of padded zero paths and priority columns.

So we need to store path with padded zeros as per described by Patrick.
and then use Order by path,priority.

This would give desired results.


To add more, number of zero's to be padded must be decided based on what is the expected highest value of line-id through out your system.

Let say you assume that system will not have more than 9999 value through out its life time, you can add 3 leading zero's

e.g. /0001/0002/


if max value can be 99999 add 4 leading zero's

e.g. /00001/00002/
so for line id = 99999

e.g. /99999/97012/

but if you get higher value say 100100 ( which is > 99999)

then /100100/ would come earlier than /9999/ (similar to 9 and 11 issue)

so please take care, I hope this would help someone.

e.g. /00001