Database Normalization

Database normalization is the process of restructuring a database to follow a series of normal forms. These are typically designated first normal form (1NF), second normal form (2NF), third normal form (3NF), etc. The goal is to reduce data redundancy and increase data integrity. This concept was initially introduced by Edgar F. Codd in 1970.

One key idea behind normalization is that each normal form must meet the criteria for the form before it. For example, before normalizing a relation to 3NF, it must first be in 2NF.

Why Normalize?

According to Codd, there are many benefits of normalizing data beyond 1NF:

A normalized database does not have undesirable insertion, update and deletion anomalies.
When the database structure is extended, it’s much less likely that a normalized database will need to change its existing structure.
A normalized database is more informative to its users.
The design will remain adaptable and won’t become inefficient or slow due to changes in how the data is used or which queries become more common.

When a database is not normalized, the following types of anomalies can occur:

Insertion anomaly: It’s possible that a record can’t be inserted into a relation at all. This can occur when some of the required data is not yet available.
Update anomaly: If the same information is expressed in multiple rows, and one of the rows is updated, then a logical inconsistency between rows can occur.
Deletion anomaly: It’s possible that deleting certain records necessitates deleting other data from completely different facts.

The Objectives section of the article gives good examples of each of these anomalies.

Definitions

Before we can dive into normalization, there are a bunch of terms that are helpful to define.

Attribute: A database column.
Relation: A database table.
Superkey: Any set of attributes that uniquely identifies a row in a relation. The set of all attributes is known as the trivial superkey.
Candidate key (or just key): A minimal set of attributes that have a unique combination of values in each row. Removing any attribute from this set can produce duplicates in the database. A candidate key is a superkey that can’t be reduced by removing an attribute.
Primary key: A specific choice of a candidate key. This is purely a DBMS concept and not applicable to relational theory. However, from a practical standpoint, a candidate key can be thought of as the primary key of a relation (except when a relation has multiple candidate keys, which is very rare).
Alternate key (or secondary key): A candidate key that is not the primary key.
Prime attributes: The set of columns contained in any of the candidate keys.
Non-prime attributes: The columns that do not occur in any candidate keys.
Functional dependency: A constraint between two attributes in a database.

X → Y means that the values of Y are determined by the values of X. Two tuples sharing the same values of X will necessarily have the same values of Y.
Partial dependency: A non-prime attribute that is functionally dependent on any proper subset of any candidate key of the relation. This is discussed in more detail in the Second Normal Form section below.
Transitive dependency: A functional dependency where X → Z indirectly through another attribute Y (X → Y, Y → Z and Y ↛ X). Transitive dependencies are named from the transitive property in mathematics (if A = B and B = C, then A = C) or logical proofs (if A implies B and B implies C, then A implies C).

First Normal Form (1NF)

At the core of 1NF is the idea that values should be atomic. According to Codd, this means that the value cannot be broken down into smaller pieces.

Disagreement

Surprisingly, this definition is very hazy, and there’s some disagreement on what atomicity means. This Stack Overflow answer gives a wonderful overview of some of the history and disagreements with the definition.

In Codd’s original paper, he intended columns not to contain sub-relations (paraphrased). Since this, it’s been more loosely interpreted as not allowing complex sub-types. For example, a jsonb data type would be considered a violation of 1NF, as would a JSON string.

Another computer scientist, Chris Date, has proposed an alternate definition. However, it’s been criticized for disallowing null values.

This Stack Exchange answer gives an interesting definition that might be more practically useful.

"Atomic" means if a value has component parts, the DBMS either ignores the existence of those parts, or it provides functions to manipulate them.

This allows for data types such as a date, which can be further decomposed into a year, month and day. It also allows for more complex JSON types.

Summary

After sampling from several sources (including Chris Date’s definition), a good set of rules 1NF might be:

No ordering: The insertion order of rows or columns does not matter.
No duplication: No rows are duplicated. This is achieved by using a primary key.
Same type: Every value in a column should have the same data type.
Atomic: The values in columns cannot be decomposed to sub-values unless the DBMS provides manipulation functions. Generally, this means sticking with data types provided by the DBMS and now embedding pseudo-types within other types, such as JSON embedded in strings.

Second Normal Form (2NF)

In addition to being in 1NF, a relation in 2NF must not have any partial dependencies. A partial dependency is where a non-prime attribute is functionally dependent on a subset of the candidate key (as opposed to the entire candidate key).

This can be broken down as follows:

If a relation only has a single attribute in its candidate key, it will never have any partial dependencies, so it’s always in 2NF.
If a relation has a multi-attribute candidate key, then every non-prime attribute must depend on all candidate key attributes.

Example

Wikipedia gives this great example.

Manufacturer	Model	Manufacturer Country
Forte	X-Prime	Italy
Forte	Ultraclean	Italy
Dent-o-Fresh	EZbrush	USA
Brushmaster	SuperBrush	USA
Kobayashi	ST-60	Japan
Hoch	Toothmaster	Germany
Hoch	X-Prime	Germany

In this example, the candidate key is {Manufacturer, Model}. Manufacturer Country is a non-prime attribute that is functionally dependent on Manufacturer. Since this is only part of the candidate key, Manufacturer has a partial dependency.

Normalization

To normalize a relation to conform to 2NF, remove the partially dependent attributes and place them in a separate relation where all candidate key attributes determine the value of the non-prime attributes.

Third Normal Form (3NF)

A relation is in 3NF if:

The relation is in 2NF.
No non-prime attribute can be transitively dependent on a candidate key.

In other words, a relation would not be in third normal form if it contained a non-prime attribute that was transitively dependent on a candidate key through another attribute.

Example

This definition is a bit difficult to understand formally, but much easier to understand via an example. Wikipedia gives the following Tournament Winners table.

Tournament	Year	Winner	Date of Birth
Indiana Invitational	1998	Al Fredrickson	July 21, 1975
Cleveland Open	1999	Bob Albertson	September 28, 1968
Des Moines Masters	1999	Al Fredrickson	July 21, 1975
Indiana Invitational	1999	Chip Masterson	March 14, 1977

In this example, the Date of Birth is dependent on the Winner, and the Winner is dependent on the Candidate Key {Tournament, Year}, so Date of Birth is transitively dependent on {Tournament, Year}.

This could easily lead to an update anomaly where a Date of Birth is different in separate records because it’s possible to have multiple Date of Birth values.

Normalizing

Like 2NF, the approach to normalizing 3NF is to split the attributes with transitive dependencies into a separate relation.

Boyce-Codd Normal Form (BCNF)

Raymond Boyce and Edger Codd developed Boyce-Codd normal form in 1971 to address certain issues with 3NF. It can be thought of as a slightly stricter version of 3NF that removes the edge cases.

Boyce and Codd’s development of this normal form was actually predated by Ian Heath in 1971.

Since that definition predated Boyce and Codd's own definition by some three years, it seems to me that BCNF ought by rights to be called Heath normal form. But it isn't.

Rewriting Third Normal Form

Before defining BCNF, it would be helpful to rewrite 3NF using this equivilent definition created by Carlo Zaniolo in 1981.

A table is in 3NF if and only if for each of its functional dependencies X → Y, at least one of the following conditions holds:

X contains Y (that is, Y is a subset of X, meaning X → Y is a trivial functional dependency)

X is a superkey

Every element of Y − X, the set difference between Y and X, is a prime attribute (i.e., each attribute in Y − X is contained in some candidate key).

In other words, a relation is in 3NF if, for every functional dependency X → Y, one of the following is true:

Y is a subset of X
X is a superkey
Each element contained in Y - X is a prime attribute.

Definition

With that ironed out, we can define BCNF and simply lacking the last option for functional dependencies:

A relational schema R is in Boyce–Codd normal form if and only if for every one of its dependencies X → Y, at least one of the following conditions hold:

X → Y is a trivial functional dependency (Y ⊆ X)

X is a superkey for schema R

In other words, a relation is in 3NF if, for every functional dependency X → Y, one of the following is true:

Y is a subset of X
X is a superkey

Example

Wikipedia gives this tricky Tennis Court Bookings relation as an example.

Court	Start time	End time	Rate Type
1	09:30	10:30	SAVER
1	11:00	12:00	SAVER
1	14:00	15:30	STANDARD
2	10:00	11:30	PREMIUM-B
2	11:30	13:30	PREMIUM-B
2	15:00	16:30	PREMIUM-A

Each row represents a booking at a tennis club. The tennis club has one hard court (Court 1) and one grass court (Court 2).
Each booking has a rate type associated with it. The court and the booker’s membership status determine the Rate Type:
- Court 1 bookings made by members have SAVER.
- Court 1 bookings made by non-members have STANDARD.
- Court 2 bookings made by members have PREMIUM-A.
- Court 2 bookings made by non-members have Premium-B.

Combining these attributes results in four possible candidate keys:

S₁ = {Court, Start time}
S₂ = {Court, End time}
S₃ = {Rate type, Start time}
S₄ = {Rate type, End time}

Intuitively, this makes sense. Court bookings cannot overlap, and the Rate Type is derived partially from the Court.

The tricky thing about these examples is none of the attributes in the Court Bookings table are non-prime because all attributes are contained in at least one of the candidate keys. Therefore, this table can never violate 3NF.

However, it violates BCNF because the functional dependency Rate Type → Court because Court is not a subset of Rate Type and Rate Type is not a superkey.

Normalization

Like the other options, the key to normalizing to 3NF involves splitting the table. However, unlike 3NF, this is not always possible. 😮