Normalization

What is Normalization

Data normalization, also known as database normalization, is a process of organizing and building databases to reduce data redundancy. Simply put, the process of database normalization ensures that each field and record is logical, which not only helps to avoid redundancy but also makes the use of any relational database more efficient: preventing any data input errors, accidental deletions, and facilitating the data update process. Understanding data normalization is very simple, but this process is more complex than it appears. Data normalization follows specific rules that dictate how a database should be organized.

 

Benefits of Normalization

Let's explain database normalization with an example of a music streaming platform's relational database. The data structures can become intricate, songs, artists, albums, and user preferences have to be logically segregated and interrelated. Here's how normalization handles the potential anomalies in such a system:
  • Update Anomalies: If a song's details change and that song is listed in multiple places in the database, every instance needs to be updated. Any oversight leads to inconsistent data.
  • Insertion Anomalies: Adding a new artist without songs can be problematic if the design mandates that every artist must have associated songs.
  • Deletion Anomalies: Deleting a user's playlist shouldn't imply deleting the user's entire profile, but poor design might inadvertently link these actions.

 

Stages of Normalization

To ensure minimized redundancy and data integrity, databases are subjected to several normal forms:
First Normal Form (1NF)
  • Each field can only have one atomic value, which means a single value
  • No two or more pieces of data are exactly duplicated.
  • There is a Primary Key in the data table, and all other fields depend on the Primary Key.

Before Normalization (Not in 1NF)

 

Student ID Subjects
1 Math, English
2 Science, English
3 Math, Science

In the table above, the "Subjects" column contains multiple values (not atomic), so it is not in 1NF.

After Normalization (In 1NF)

 

Student ID Subject
1 Math
1 English
2 Science
2 English
3 Math
3 Science

In this normalized table, each column contains only atomic values, satisfying the conditions of 1NF. We have split the "Subjects" column into individual rows, so each row represents a single, indivisible data point.

Second Normal Form (2NF)

  • Meets all the requirements of 1NF.
  • Additionally, all the attributes (columns) are functionally dependent on the entire primary key.

Before Normalization (Not in 2NF)

Let's assume we have a database of a store's sales records:

InvoiceID ProductID ProductName Quantity UnitPrice TotalPrice CustomerName
101 1 ProductA 2 10 20 Alice
101 2 ProductB 1 20 20 Alice
102 1 ProductA 3 10 30 Bob

In this table, the combination of InvoiceID and ProductID forms the primary key. However, we can see that the ProductName and UnitPrice are dependent on ProductID, and CustomerName is dependent on InvoiceID, not the entire primary key, so the table is not in 2NF.

After Normalization (In 2NF)

To normalize this table into 2NF, we need to remove partial dependencies by splitting the table into multiple tables, so that all attributes in each table are functionally dependent on the entire primary key.

Invoice Table

InvoiceID CustomerName
101 Alice
102 Bob

Product Table

ProductID ProductName UnitPrice
1 ProductA 10
2 ProductB 20

Sales Table

InvoiceID ProductID Quantity TotalPrice
101 1 2 20
101 2 1 20
102 1 3 30

Now, the Product Table contains information solely about products, functionally dependent on ProductID, and the Invoice Table contains information about invoices, dependent on InvoiceID. The Sales Table links products with invoices, with all attributes functionally dependent on the composite primary key (InvoiceID and ProductID). This database structure is in 2NF.

Third Normal Form (3NF)

  • Meets all the requirements of 2NF.
  • Non-prime attributes must be dependent on the primary key alone, not on other non-prime attributes. This stage further removes transitive dependencies.

Before Normalization (Not in 3NF)

Consider we have a database that stores information about employees and their departments:

EmployeeID

EmployeeName

DepartmentID

DepartmentName

DepartmentLocation

1

Alice

100

HR

Building A

2

Bob

100

HR

Building A

3

Charlie

200

IT

Building B

In this table, the primary key is EmployeeID. However, DepartmentName and DepartmentLocation are transitively dependent on EmployeeID through DepartmentID. Therefore, the table is not in 3NF.

After Normalization (In 3NF)

To normalize this table to 3NF, we need to remove the transitive dependencies by splitting the table into more tables, so that all attributes in each table are functionally dependent only on the primary key.

Employee Table

EmployeeID

EmployeeName

DepartmentID

1

Alice

100

2

Bob

100

3

Charlie

200

Department Table

DepartmentID

DepartmentName

DepartmentLocation

100

HR

Building A

200

IT

Building B

Now, in the Employee Table, every attribute is functionally dependent only on the primary key, EmployeeID, and in the Department Table, every attribute is functionally dependent only on the primary key, DepartmentID. This database structure is in 3NF, as it has no transitive dependencies, and every non-key attribute is functionally dependent only on the primary key.

BCNF (Boyce-Codd Normal Form) Rules
BCNF takes the principles of 3NF a step further, imposing more stringent conditions to ensure the structural integrity of your database. Here's a simple guide to understanding when your database meets the BCNF standard:

  • Single Column Primary Key: If your primary key consists of just one column, you're in luck! Once your database satisfies the 3NF rules, it automatically complies with BCNF as well.
  • Composite Primary Key: Things get a bit more complex if your primary key is composed of several columns. In this case, to meet BCNF criteria, you must:
    • Ensure your database meets all 3NF requirements.
    • Ensure that columns within the primary key do not have dependencies on columns that are not a part of the primary key.

Now let's move on to creating an example that illustrates these rules:

Before Normalization (Not in BCNF)

Consider a database table storing information about courses and the professors teaching them in different semesters:

CourseID ProfessorID Semester ProfessorName
101 1 Spring Dr. Smith
102 2 Fall Dr. Johnson
101 1 Fall Dr. Smith

In this table, the combination of CourseID, ProfessorID, and Semester form the composite primary key. The ProfessorName is dependent on ProfessorID, which is a part of the composite primary key, but not on the entire primary key, hence it's not in BCNF.

After Normalization (In BCNF)

We can normalize this table into BCNF by removing the partial dependency of ProfessorName on the part of the composite primary key. We split the table into two:

Professor Table

ProfessorID ProfessorName
1 Dr. Smith
2 Dr. Johnson

Course Table

CourseID ProfessorID Semester
101 1 Spring
102 2 Fall
101 1 Fall

Now, in the Professor Table, the ProfessorName is only dependent on ProfessorID, and in the Course Table, all attributes are dependent on the composite primary key (CourseID, ProfessorID, and Semester), thus meeting the criteria for BCNF.

 

Normalization vs. Denormalization

While normalization structures databases to minimize data redundancy, it can scatter related data across numerous separate tables. Retrieving all courses taught by a specific professor along with the respective classroom details necessitates a JOIN operation encompassing the "Professors", "Courses", and "Classrooms" tables. As the number of involved tables increases, such operations can considerably decelerate query responses.

Denormalization addresses this issue, strategically incorporating redundancy to diminish the necessity for JOIN operations during query executions. This accelerates certain read operations, enhancing database performance for specific tasks. This methodology involves amalgamating tables or augmenting existing ones with redundant data, aiming to streamline data retrieval processes. However, it's important to note that denormalization comes with hidden costs, including increased storage requirements and potential complexities in data maintenance and consistency.

The choice between normalization and denormalization isn't straightforward; it demands a nuanced comprehension of the unique workload, query dynamics, and performance indicators pertinent to the database. This knowledge facilitates the crafting of a balanced database design that harmonizes efficiency and data integrity.