My personal notes for Database Normalization

A video by Studytonight

Normalization is a technique of organizing data into multiple, related tables to minimize data redundancy.

Data redundancy is the representation of data in multiple places. This has many drawbacks:

  • Size: Redundant data takes up extra space.
  • Insertion anomaly: When inserting duplicate data, you create data redundancies.
  • Updation anomaly: If you need to update duplicated data, you must do so in all the places where it’s referenced. “If even a single row is left out, it will lead to inconsistent data.”
  • Deletion anomaly: If we delete all instances of the redundant data in our database, we can unintentionally delete the associated context as well. When data exists in columns that belong in a separate table, there’s no way for us to represent the case where there are 0 records.

Normalization solves these problems. The goal is to create logical, independent, but related data.

There are three basic normal forms and one bonus form. Each form requires the previous forms as prerequisites.

First Normal Form (1NF)

This form aims to design your table scalably so it can be easily extended. This form has four rules:

  1. Each column should contain atomic values. An atomic value is a value that cannot be divided. In other words, the column value represent a single value and not multiple values.
  2. Every value in a column should have the same type.
  3. Each column should have a unique name. (Is it even possible to break this rule?)
  4. The order in which you store your data doesn’t matter.

Second Normal Form (2NF)

“There should be no partial dependency in the data.” No attribute should depend on only part of the primary key. In other words, a non-prime attribute depends on a part of the primary key.

More formally, a table has a composite key ABC, and there’s a column D that’s only dependent on A, then the table has a partial dependency. Don’t do this!

By definition, violations of second normal form can’t occur when a table only uses a single column for its primary key. It often creeps up when a column only depends on one of the columns in the key.

Example

The following table is a violation of Second Normal Form.

Student ID Course ID Instructor Name
1 CS101 Smith
1 CS102 Johnson
2 CS101 Smith

In this example, the table has a composite primary key of Student ID and Course ID. However, Instructor Name depends only on Course ID and not Student ID.

Third Normal Form (3NF)

Columns can not have transitive dependencies. This occurs when an attribute in the table depends on a non-primary key. In other words, a non-prime attribute depends on a non-prime attribute.

Formally, if a table has a column B that’s dependent on A, and both are not prime attributes, then the table has a transitive dependency. If B depends on A, then A must be a primary key.

Example

A good example of a violation of Third Normal Form comes from the Third Normal Form Wikipedia article.

Tournament Year Winner Winner’s Date of Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977

The breach of 3NF occurs because the non-prime attribute (Winner's date of birth) is transitively dependent on the candidate key {Tournament, Year} through the non-prime attribute Winner. The fact that Winner's date of birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.

BCNF (Boyce-Codd Normal Form)

The violation of this normal form occurs when a prime attribute depends on a non-prime attribute. For any columnar dependency B that depends on A, A should be a super key (a primary key).

This can be considered an upgrade to Third Normal Form.