Normalization

Join StarRocks Community on Slack

Connect on Slack

TABLE OF CONTENTS

See All Glossary Items

Normalization vs Denormalization: The Trade-offs You Need to Know

Cardinality

Mastering Data Normalization in 5 Simple Steps

OLAP Explained: What It Is, How It Works, and Why It Matters

Query Plan

Publish date: Sep 20, 2023 3:39:30 PM

What is Normalization

Data normalization, also known as database normalization, is a process of organizing and building databases to reduce data redundancy. Simply put, the process of database normalization ensures that each field and record is logical, which not only helps to avoid redundancy but also makes the use of any relational database more efficient: preventing any data input errors, accidental deletions, and facilitating the data update process. Understanding data normalization is very simple, but this process is more complex than it appears. Data normalization follows specific rules that dictate how a database should be organized.

Benefits of Normalization

Let's explain database normalization with an example of a music streaming platform's relational database. The data structures can become intricate, songs, artists, albums, and user preferences have to be logically segregated and interrelated. Here's how normalization handles the potential anomalies in such a system:

Update Anomalies: If a song's details change and that song is listed in multiple places in the database, every instance needs to be updated. Any oversight leads to inconsistent data.
Insertion Anomalies: Adding a new artist without songs can be problematic if the design mandates that every artist must have associated songs.
Deletion Anomalies: Deleting a user's playlist shouldn't imply deleting the user's entire profile, but poor design might inadvertently link these actions.

Stages of Normalization

To ensure minimized redundancy and data integrity, databases are subjected to several normal forms:

First Normal Form (1NF)

Each field can only have one atomic value, which means a single value
No two or more pieces of data are exactly duplicated.
There is a Primary Key in the data table, and all other fields depend on the Primary Key.

Before Normalization (Not in 1NF)

Student ID	Subjects
1	Math, English
2	Science, English
3	Math, Science

In the table above, the "Subjects" column contains multiple values (not atomic), so it is not in 1NF.

After Normalization (In 1NF)

Student ID	Subject
1	Math
1	English
2	Science
2	English
3	Math
3	Science

In this normalized table, each column contains only atomic values, satisfying the conditions of 1NF. We have split the "Subjects" column into individual rows, so each row represents a single, indivisible data point.

Second Normal Form (2NF)

Meets all the requirements of 1NF.
Additionally, all the attributes (columns) are functionally dependent on the entire primary key.

Before Normalization (Not in 2NF)

Let's assume we have a database of a store's sales records:

InvoiceID	ProductID	ProductName	Quantity	UnitPrice	TotalPrice	CustomerName
101	1	ProductA	2	10	20	Alice
101	2	ProductB	1	20	20	Alice
102	1	ProductA	3	10	30	Bob

In this table, the combination of InvoiceID and ProductID forms the primary key. However, we can see that the ProductName and UnitPrice are dependent on ProductID, and CustomerName is dependent on InvoiceID, not the entire primary key, so the table is not in 2NF.

After Normalization (In 2NF)

To normalize this table into 2NF, we need to remove partial dependencies by splitting the table into multiple tables, so that all attributes in each table are functionally dependent on the entire primary key.

Invoice Table

InvoiceID	CustomerName
101	Alice
102	Bob

Product Table

ProductID	ProductName	UnitPrice
1	ProductA	10
2	ProductB	20

Sales Table

InvoiceID	ProductID	Quantity	TotalPrice
101	1	2	20
101	2	1	20
102	1	3	30

Now, the Product Table contains information solely about products, functionally dependent on ProductID, and the Invoice Table contains information about invoices, dependent on InvoiceID. The Sales Table links products with invoices, with all attributes functionally dependent on the composite primary key (InvoiceID and ProductID). This database structure is in 2NF.

Third Normal Form (3NF)

Meets all the requirements of 2NF.
Non-prime attributes must be dependent on the primary key alone, not on other non-prime attributes. This stage further removes transitive dependencies.

Before Normalization (Not in 3NF)

Consider we have a database that stores information about employees and their departments:

EmployeeID	EmployeeName	DepartmentID	DepartmentName	DepartmentLocation
1	Alice	100	HR	Building A
2	Bob	100	HR	Building A
3	Charlie	200	IT	Building B

In this table, the primary key is EmployeeID. However, DepartmentName and DepartmentLocation are transitively dependent on EmployeeID through DepartmentID. Therefore, the table is not in 3NF.

After Normalization (In 3NF)

To normalize this table to 3NF, we need to remove the transitive dependencies by splitting the table into more tables, so that all attributes in each table are functionally dependent only on the primary key.

Employee Table

EmployeeID	EmployeeName	DepartmentID
1	Alice	100
2	Bob	100
3	Charlie	200

Department Table

DepartmentID	DepartmentName	DepartmentLocation
100	HR	Building A
200	IT	Building B

Now, in the Employee Table, every attribute is functionally dependent only on the primary key, EmployeeID, and in the Department Table, every attribute is functionally dependent only on the primary key, DepartmentID. This database structure is in 3NF, as it has no transitive dependencies, and every non-key attribute is functionally dependent only on the primary key.

BCNF (Boyce-Codd Normal Form) Rules
BCNF takes the principles of 3NF a step further, imposing more stringent conditions to ensure the structural integrity of your database. Here's a simple guide to understanding when your database meets the BCNF standard:

Single Column Primary Key: If your primary key consists of just one column, you're in luck! Once your database satisfies the 3NF rules, it automatically complies with BCNF as well.
Composite Primary Key: Things get a bit more complex if your primary key is composed of several columns. In this case, to meet BCNF criteria, you must:
- Ensure your database meets all 3NF requirements.
- Ensure that columns within the primary key do not have dependencies on columns that are not a part of the primary key.

Now let's move on to creating an example that illustrates these rules:

Before Normalization (Not in BCNF)

Consider a database table storing information about courses and the professors teaching them in different semesters:

CourseID	ProfessorID	Semester	ProfessorName
101	1	Spring	Dr. Smith
102	2	Fall	Dr. Johnson
101	1	Fall	Dr. Smith

In this table, the combination of CourseID, ProfessorID, and Semester form the composite primary key. The ProfessorName is dependent on ProfessorID, which is a part of the composite primary key, but not on the entire primary key, hence it's not in BCNF.

After Normalization (In BCNF)

We can normalize this table into BCNF by removing the partial dependency of ProfessorName on the part of the composite primary key. We split the table into two:

Professor Table

ProfessorID	ProfessorName
1	Dr. Smith
2	Dr. Johnson

Course Table

CourseID	ProfessorID	Semester
101	1	Spring
102	2	Fall
101	1	Fall

Now, in the Professor Table, the ProfessorName is only dependent on ProfessorID, and in the Course Table, all attributes are dependent on the composite primary key (CourseID, ProfessorID, and Semester), thus meeting the criteria for BCNF.

Normalization vs. Denormalization

While normalization structures databases to minimize data redundancy, it can scatter related data across numerous separate tables. Retrieving all courses taught by a specific professor along with the respective classroom details necessitates a JOIN operation encompassing the "Professors", "Courses", and "Classrooms" tables. As the number of involved tables increases, such operations can considerably decelerate query responses.

Denormalization addresses this issue, strategically incorporating redundancy to diminish the necessity for JOIN operations during query executions. This accelerates certain read operations, enhancing database performance for specific tasks. This methodology involves amalgamating tables or augmenting existing ones with redundant data, aiming to streamline data retrieval processes. However, it's important to note that denormalization comes with hidden costs, including increased storage requirements and potential complexities in data maintenance and consistency.

The choice between normalization and denormalization isn't straightforward; it demands a nuanced comprehension of the unique workload, query dynamics, and performance indicators pertinent to the database. This knowledge facilitates the crafting of a balanced database design that harmonizes efficiency and data integrity.