As I was studying Data Vault concepts I found myself in a dire desire of understanding more of the third normal form, 3NF. Here in this article I will present you everything I’ve gathered so far about the 3NF.
Why is the 3NF important to Data Vault? Well, you certainly don’t need to be a jedi master of the 3NF to perform Data Vault modeling, but understanding its characteristics and how the many different keys behave in a RDBMS can definitely help creating your model.
Before we can fully understand what 3NF is, we need to understand 1NF and 2NF.
Note: Please, see the reference at the bottom of this article for all material used to produce this piece.
What is 1NF?
In order to meet 1NF a table must meet the following criteria:
- The data are stored in a two-dimensional table.
- There are no repeating groups.
- It must have a primary key
What are repeating groups?
A repeating group is an attribute that has more than one value in each row of a table. A repeating group is directly analogous to a multivalued attribute in an ER diagram.
To achieve 1NF you need to handle your table, much likely creating new tables, so data can be normalized.
Take the table players, for this example. It as many multivalued attributes, which inflicts 1NF.
player_id | items |
jane | 1 amulets, 2 rings, 1 shields |
john | 12 arrows, 1 bow, 2 rings |
mary | 2 rings, 1 staff, 2 hood |
lisa | 1 rings, 2 dagger, 2 hood |
In order to achieve 1NF we have to decompose this table in two tables, like these:
player_id |
jane |
john |
mary |
lisa |
Here played being the primary key.
player_id | Item_type | Quantity |
jane | amulets | 1 |
jane | rings | 2 |
jane | shields | 1 |
john | arrows | 12 |
john | bow | 1 |
john | rings | 2 |
mary | rings | 2 |
mary | staff | 1 |
mary | hood | 2 |
lisa | rings | 1 |
lisa | dagger | 2 |
lisa | hood | 2 |
Here the combination of player + items being the primary key.
What is 2NF?
In theoretical terms, second formal form relations are defined as follows:
- The relation is in first normal form.
- All non-key attributes are functionally dependent on the entire primary key. Meaning: no part-key dependency allowed.
Let’s enhance our last table, to add player rating:
player_id | item_type | item_quantity | player_rating |
jane | amulets | 1 | Intermediate |
jane | rings | 2 | Intermediate |
jane | shields | 1 | Intermediate |
john | arrows | 12 | Beginner |
john | bow | 1 | Beginner |
john | rings | 2 | Beginner |
mary | rings | 2 | Advanced |
mary | staff | 1 | Advanced |
mary | hood | 2 | Advanced |
lisa | rings | 1 | Advanced |
lisa | dagger | 2 | Advanced |
lisa | hood | 2 | Advanced |
On the table above item_quantity and player_rating are non-key attributes, meaning, they do not compose or participate on the primary key for this table. We can therefore say that: player_id + item_type determines item_quantity, the terminology being:
- { player_id , item_type } -> item_quantity
Meaning that, for every unique combination of player_id + item_type (aka, for every primary key) there is only one item_quantity. The next question is, are there other functional dependencies on this table? Yes, there are. Here are they:
- player_id -> player_rating
But why is the first functional dependency correct and not the second one?
Because functional dependency is a one-way relationship between two attributes such that at any given time, for each unique value of attribute A, only one value of attribute B is associated with it.
And, as we can see here, on the functional dependency “player_id -> player_rating” there are no uniqueness on this relationship.
Now that we have identified all the functional dependencies (FD) we can use the determinant of each of this FDs, and create a separate table, a separate entity, using those determinants as primary key. Like these:
Note on determinant: To clarify what are our determinants here we first need to know which are the Functional Dependencies we got, so here are the FDs:
- { player_id , item_type } -> item_quantity
- player_id -> player_rating
So we have determinant { player_id , item_type } for our FD #1 and player_id as our determinante for FD #2.
What is the definition of Determinant?
An attribute on which other attributes are functionally dependent on.
player_id | player_rating |
jane | Intermediate |
john | Beginner |
mary | Advanced |
lisa | Advanced |
player_id | item_type | item_quantity |
jane | amulets | 1 |
jane | rings | 2 |
jane | shields | 1 |
john | arrows | 12 |
john | bow | 1 |
john | rings | 2 |
mary | rings | 2 |
mary | staff | 1 |
mary | hood | 2 |
lisa | rings | 1 |
lisa | dagger | 2 |
lisa | hood | 2 |
By doing that we assure it’s now on 2NF, and that there are no part-key dependencies, in other words, every attribute depends on the whole primary key.
What is the 3NF?
The theoretical definition of third normal form says:
- The relation is in second normal form.
- There are no transitive dependencies.
Another way to describe 3NF:
- Each non-prime attribute in a table should depend on every candidate key.
- It should never depend on part of a candidate key.
- It should never depend on other non-prime attributes.
Know this already, the 3NF* mantra:
Every attribute in a table should depend on the key, the whole, and nothing but the key.
So, let’s take this enhanced player table:
player_id | player_rating | player_skill_level |
jane | Intermediate | 6 |
john | Beginner | 3 |
mary | Advanced | 8 |
lisa | Advanced | 9 |
We added player_skill_level, a scale so we know how high is the player skill level.
skill_level | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
beginner | intermediate | advanced |
Let’s assume that JOHN skill level went up from 3 to 4, but, for some reason, the player_rating wasn’t properly updated, resulting on this table:
player_id | player_rating | player_skill_level |
jane | Intermediate | 6 |
john | Beginner | 4 |
mary | Advanced | 8 |
lisa | Advanced | 9 |
As we can see, player_rating should be Intermediate, but it is showing Beginner due to an update anomaly.
It is important to say that it is easy for us, as humans, to understand that there is an error on this table, but this table is perfectly fine regarding 2NF, it is in compliance with 2NF as all non-keys attributes are functionally dependent on the entire primary key.
So the real question is, how are those non-key attributes dependent on the key?
Player_rating is dependent on player_id. Like this:
- player_id -> player_rating
Player_skill_level is also dependent on player_id but indirectly, like this:
- player_id -> player_skill_level -> player_rating
- this is a transitive dependency, where player_rating depends on player_skill_level
A dependency like the one of player_rating is called a transitive dependency. A dependency of a non-key attribute on another non-key attribute.
To achieve 3NF we have to adjust our tables so they look like this:
player_id | player_skill_level |
jane | 6 |
john | 4 |
mary | 8 |
lisa | 9 |
player_skill_level | player_rating |
1 | Beginner |
2 | Beginner |
3 | Beginner |
4 | Intermediate |
5 | Intermediate |
6 | Intermediate |
7 | Advanced |
8 | Advanced |
9 | Advanced |
*3NF can be described as:
Every non-key attribute in a table should depend on the key, the whole, and nothing but the key.
*3NF enhanced version, Boyce-Codd Normal Form (BCNF), can be described as:
Every attribute in a table should depend on the key, the whole, and nothing but the key.
Note about Transitive Dependencies:
Transitive dependencies comes from a functional dependency pattern, such as: A -> B and B -> C; therefore A -> C. A transitive dependency therefore exists only when the determinant that is not the primary key is not a candidate key for the relation (remember: relation here is the relation between attributes).
When you have a transitive dependency in a 2NF relation, you should break the relation into two smaller relations, each of which has one of the determinants in the transitive dependency as its primary key. The attributes determined by the determinant become non-key attributes in each relation. This removes the transitive dependency and places the relation in third normal form.
What is Boyce-Codd Normal Form?
There is not much of a difference between 3NF and BCNF, in fact, in the real world most 3FN models are in fact BCNF models.
Note: If your relations are in third normal form and do not exhibit the special characteristics that BCNF, 4NF, and 5NF were designed to handle, then they are automatically in 5NF.
Before discussing about BCNF let’s review 3NF definition:
- Each non-prime attribute in a table should depend on every candidate key.
- It should never depend on part of a candidate key.
- It should never depend on other non-prime attributes.
However, if you pay close attention to this 3NF definition, you will notice that if every attribute on a table is a prime attribute, then transient dependencies are going to happen, and that, regardless of being accepted as 3NF, is a design failure. That’s why we have BCNF.
See this table:
release_year | popularity_ranking | movie_name | release_year_and_month |
2008 | 1 | The Dark Knight | 2008-07 |
2008 | 2 | Indiana Jones | 2008-05 |
2008 | 3 | Kung Fu Panda | 2008-06 |
2009 | 1 | Avatar | 2009-12 |
2009 | 2 | Harry Potter | 2009-07 |
2009 | 3 | Ice Age | 2009-07 |
Here are the candidate keys:
- Movie_name
- { release_year , popularity_ranking }
- { release_year_and_month , popularity_ranking }
All keys are prime attributes, there are no non-prime attributes.
The problem we have appears on the following functional dependency:
- release_year_and_month -> release_year
This problem occurs because if you happen to update the release_year column you can end with an update anomaly if your column release_year do not update accordingly. Like this:
release_year | popularity_ranking | movie_name | release_year_and_month |
2008 | 1 | The Dark Knight | 2008-07 |
2008 | 2 | Indiana Jones | 2008-05 |
2008 | 3 | Kung Fu Panda | 2018-06 |
2009 | 1 | Avatar | 2009-12 |
2009 | 2 | Harry Potter | 2009-07 |
2009 | 3 | Ice Age | 2009-07 |
However, even knowing that there is an anomaly in this table, according to 2NF and 3NF the table complies, and that’s why BCNF enhance the 3NF definition.
To be in Boyce-Codd normal form, a relation must meet the following rules:
- The relation must be in third normal form.
- All determinants must be candidate keys.
What is a Determinant?
An attribute on which other attributes are functionally dependent on.
What is a Candidate Key?
A candidate key is an attribute, or set of attributes, that uniquely identify a row in the table. More detailed way of putting it: a column or combination of columns (columns also means attributes) that can be used as the primary key of a relation (relation here is about attribute to attribute relation).
What is a non-prime attribute?
A prime attribute is an attribute that belongs to at least one candidate key, therefore, non-prime attributes are attributes that doesn’t belong to any candidate key.
As we know, release_year depends on release_year_and_month, that’s our problem, because release_year_and_month is a determinant, since it determines release_year, but elease_year_and_month is not a candidate key.
To fix this we can simple replace the release_year_and_month attribute for release_month attribute.
release_year | popularity_ranking | movie_name | release_month |
2008 | 1 | The Dark Knight | July |
2008 | 2 | Indiana Jones | May |
2008 | 3 | Kung Fu Panda | June |
2009 | 1 | Avatar | December |
2009 | 2 | Harry Potter | July |
2009 | 3 | Ice Age | July |
The table is now in BCNF, and here we can informally state that BCNF definition as:
Every attribute in a table should depend on the key, the whole, and nothing but the key**.
** Keep in mind that “KEY” here refers to “every candidate key”.
References
Learn Database Normalization – 1NF, 2NF, 3NF, 4NF, 5NF
Learn Boyce-Codd Normal Form (BCNF)
https://www.sciencedirect.com/topics/computer-science/transitive-dependency
https://www.sciencedirect.com/topics/computer-science/third-normal-form
Database Keys Made Easy – Primary, Foreign, Candidate, Surrogate, & Many More
Jan L. Harrington, in Relational Database Design (Third Edition), 2009