My personal notes for Database Normalization

A video by Studytonight

Normalization is a technique of organizing data into multiple, related tables to minimize data redundancy.

Data redundancy is the representation of data in multiple places. This has many drawbacks:

  • Size: Redundant data takes up extra space.
  • Insertion anomaly: When inserting duplicate data, you create data redundancies.
  • Updation anomaly: If you need to update duplicated data, you must do so in all the places where it's referenced. "If even a single row is left out, it will lead to inconsistent data."
  • Deletion anomaly: If we delete all instances of the redundant data in our database, we can unintentionally delete the associated context as well. When data exists in columns that belong in a separate table, there's no way for us to represent the case where there are 0 records.

Normalization solves these problems. The goal is to create logical, independent, but related data.

There are three basic normal forms and one bonus form. Each form requires the previous forms as prerequisites.

Definitions

Before diving into normal forms, there are a few important definitions:

  • Superkey: Any combination of columns that uniquely identifies rows.
  • Candidate key: A minimal superkey (remove any column and it loses uniqueness)
  • Primary key: The candidate key you choose to use as the main identifier

First Normal Form (1NF)

This form aims to design your table scalably so it can be easily extended. This form has two rules:

  1. Each column should contain atomic values. An atomic value is a value that cannot be divided. In other words, the column value represents a single value and not multiple values.
  2. Each row must be uniquely identifiable (i.e., the table must have a primary key).

Second Normal Form (2NF)

"There should be no partial dependency in the data." No attribute should depend on only part of the primary key. In other words, a non-prime attribute depends on a part of the primary key.

More formally, a table has a composite key ABC, and there's a column D that's only dependent on A, then the table has a partial dependency. Don't do this!

By definition, violations of second normal form can't occur when a table only uses a single column for its primary key. It often creeps up when a column only depends on one of the columns in the key.

Example

The following table is a violation of Second Normal Form.

Student ID Course ID Instructor Name
1 CS101 Smith
1 CS102 Johnson
2 CS101 Smith

In this example, the table has a composite primary key of Student ID and Course ID. However, Instructor Name depends only on Course ID and not Student ID.

Third Normal Form (3NF)

Columns can not have transitive dependencies. This occurs when an attribute in the table depends on a non-primary key. In other words, a non-prime attribute depends on a non-prime attribute.

Formally, if a table has a column B that's dependent on A, and both are not prime attributes, then the table has a transitive dependency. If B depends on A, then A must be a primary key.

Example

A good example of a violation of Third Normal Form comes from the Third Normal Form Wikipedia article.

Tournament Year Winner Winner's Date of Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977

The breach of 3NF occurs because the non-prime attribute (Winner's date of birth) is transitively dependent on the candidate key {Tournament, Year} through the non-prime attribute Winner. The fact that Winner's date of birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.

BCNF (Boyce-Codd Normal Form)

For every functional dependency A → B, A must be a superkey. No exceptions.

This is stricter than 3NF. While 3NF allows a non-superkey to determine a prime attribute, BCNF does not. The classic BCNF violation involves overlapping candidate keys where a non-key column determines part of a key.

Example

Student Subject Teacher
Alice Math Smith
Alice Physics Jones
Bob Math Smith

Constraints: each teacher teaches only one subject; students take each subject from one teacher.

Candidate keys: {Student, Subject} and {Student, Teacher}. But Teacher → Subject violates BCNF because Teacher alone isn't a superkey—even though all three columns are prime attributes.