Normalization
What is Normalization
Data normalization, also known as database normalization, is a process of organizing and building databases to reduce data redundancy. Simply put, the process of database normalization ensures that each field and record is logical, which not only helps to avoid redundancy but also makes the use of any relational database more efficient: preventing any data input errors, accidental deletions, and facilitating the data update process. Understanding data normalization is very simple, but this process is more complex than it appears. Data normalization follows specific rules that dictate how a database should be organized.
Benefits of Normalization
-
Update Anomalies: If a song's details change and that song is listed in multiple places in the database, every instance needs to be updated. Any oversight leads to inconsistent data.
-
Insertion Anomalies: Adding a new artist without songs can be problematic if the design mandates that every artist must have associated songs.
-
Deletion Anomalies: Deleting a user's playlist shouldn't imply deleting the user's entire profile, but poor design might inadvertently link these actions.
Stages of Normalization
- Each field can only have one atomic value, which means a single value
- No two or more pieces of data are exactly duplicated.
- There is a Primary Key in the data table, and all other fields depend on the Primary Key.
Before Normalization (Not in 1NF)
Student ID | Subjects |
---|---|
1 | Math, English |
2 | Science, English |
3 | Math, Science |
In the table above, the "Subjects" column contains multiple values (not atomic), so it is not in 1NF.
After Normalization (In 1NF)
Student ID | Subject |
---|---|
1 | Math |
1 | English |
2 | Science |
2 | English |
3 | Math |
3 | Science |
In this normalized table, each column contains only atomic values, satisfying the conditions of 1NF. We have split the "Subjects" column into individual rows, so each row represents a single, indivisible data point.
Second Normal Form (2NF)
- Meets all the requirements of 1NF.
- Additionally, all the attributes (columns) are functionally dependent on the entire primary key.
Before Normalization (Not in 2NF)
Let's assume we have a database of a store's sales records:
InvoiceID | ProductID | ProductName | Quantity | UnitPrice | TotalPrice | CustomerName |
---|---|---|---|---|---|---|
101 | 1 | ProductA | 2 | 10 | 20 | Alice |
101 | 2 | ProductB | 1 | 20 | 20 | Alice |
102 | 1 | ProductA | 3 | 10 | 30 | Bob |
In this table, the combination of InvoiceID
and ProductID
forms the primary key. However, we can see that the ProductName
and UnitPrice
are dependent on ProductID
, and CustomerName
is dependent on InvoiceID
, not the entire primary key, so the table is not in 2NF.
After Normalization (In 2NF)
To normalize this table into 2NF, we need to remove partial dependencies by splitting the table into multiple tables, so that all attributes in each table are functionally dependent on the entire primary key.
Invoice Table
InvoiceID | CustomerName |
---|---|
101 | Alice |
102 | Bob |
Product Table
ProductID | ProductName | UnitPrice |
---|---|---|
1 | ProductA | 10 |
2 | ProductB | 20 |
Sales Table
InvoiceID | ProductID | Quantity | TotalPrice |
---|---|---|---|
101 | 1 | 2 | 20 |
101 | 2 | 1 | 20 |
102 | 1 | 3 | 30 |
Now, the Product Table
contains information solely about products, functionally dependent on ProductID
, and the Invoice Table
contains information about invoices, dependent on InvoiceID
. The Sales Table
links products with invoices, with all attributes functionally dependent on the composite primary key (InvoiceID
and ProductID
). This database structure is in 2NF.
Third Normal Form (3NF)
- Meets all the requirements of 2NF.
- Non-prime attributes must be dependent on the primary key alone, not on other non-prime attributes. This stage further removes transitive dependencies.
Before Normalization (Not in 3NF)
Consider we have a database that stores information about employees and their departments:
EmployeeID |
EmployeeName |
DepartmentID |
DepartmentName |
DepartmentLocation |
---|---|---|---|---|
1 |
Alice |
100 |
HR |
Building A |
2 |
Bob |
100 |
HR |
Building A |
3 |
Charlie |
200 |
IT |
Building B |
In this table, the primary key is EmployeeID
. However, DepartmentName
and DepartmentLocation
are transitively dependent on EmployeeID
through DepartmentID
. Therefore, the table is not in 3NF.
After Normalization (In 3NF)
To normalize this table to 3NF, we need to remove the transitive dependencies by splitting the table into more tables, so that all attributes in each table are functionally dependent only on the primary key.
Employee Table
EmployeeID |
EmployeeName |
DepartmentID |
---|---|---|
1 |
Alice |
100 |
2 |
Bob |
100 |
3 |
Charlie |
200 |
Department Table
DepartmentID |
DepartmentName |
DepartmentLocation |
---|---|---|
100 |
HR |
Building A |
200 |
IT |
Building B |
Now, in the Employee Table
, every attribute is functionally dependent only on the primary key, EmployeeID
, and in the Department Table
, every attribute is functionally dependent only on the primary key, DepartmentID
. This database structure is in 3NF, as it has no transitive dependencies, and every non-key attribute is functionally dependent only on the primary key.
BCNF takes the principles of 3NF a step further, imposing more stringent conditions to ensure the structural integrity of your database. Here's a simple guide to understanding when your database meets the BCNF standard:
- Single Column Primary Key: If your primary key consists of just one column, you're in luck! Once your database satisfies the 3NF rules, it automatically complies with BCNF as well.
- Composite Primary Key: Things get a bit more complex if your primary key is composed of several columns. In this case, to meet BCNF criteria, you must:
- Ensure your database meets all 3NF requirements.
- Ensure that columns within the primary key do not have dependencies on columns that are not a part of the primary key.
Now let's move on to creating an example that illustrates these rules:
Before Normalization (Not in BCNF)
Consider a database table storing information about courses and the professors teaching them in different semesters:
CourseID | ProfessorID | Semester | ProfessorName |
---|---|---|---|
101 | 1 | Spring | Dr. Smith |
102 | 2 | Fall | Dr. Johnson |
101 | 1 | Fall | Dr. Smith |
In this table, the combination of CourseID
, ProfessorID
, and Semester
form the composite primary key. The ProfessorName
is dependent on ProfessorID
, which is a part of the composite primary key, but not on the entire primary key, hence it's not in BCNF.
After Normalization (In BCNF)
We can normalize this table into BCNF by removing the partial dependency of ProfessorName
on the part of the composite primary key. We split the table into two:
Professor Table
ProfessorID | ProfessorName |
---|---|
1 | Dr. Smith |
2 | Dr. Johnson |
Course Table
CourseID | ProfessorID | Semester |
---|---|---|
101 | 1 | Spring |
102 | 2 | Fall |
101 | 1 | Fall |
Now, in the Professor Table
, the ProfessorName
is only dependent on ProfessorID
, and in the Course Table
, all attributes are dependent on the composite primary key (CourseID
, ProfessorID
, and Semester
), thus meeting the criteria for BCNF.
Normalization vs. Denormalization
While normalization structures databases to minimize data redundancy, it can scatter related data across numerous separate tables. Retrieving all courses taught by a specific professor along with the respective classroom details necessitates a JOIN operation encompassing the "Professors", "Courses", and "Classrooms" tables. As the number of involved tables increases, such operations can considerably decelerate query responses.
Denormalization addresses this issue, strategically incorporating redundancy to diminish the necessity for JOIN operations during query executions. This accelerates certain read operations, enhancing database performance for specific tasks. This methodology involves amalgamating tables or augmenting existing ones with redundant data, aiming to streamline data retrieval processes. However, it's important to note that denormalization comes with hidden costs, including increased storage requirements and potential complexities in data maintenance and consistency.
The choice between normalization and denormalization isn't straightforward; it demands a nuanced comprehension of the unique workload, query dynamics, and performance indicators pertinent to the database. This knowledge facilitates the crafting of a balanced database design that harmonizes efficiency and data integrity.
Join StarRocks Community on Slack
Connect on Slack