Data normalization, also known as database normalization, is a process of organizing and building databases to reduce data redundancy. Simply put, the process of database normalization ensures that each field and record is logical, which not only helps to avoid redundancy but also makes the use of any relational database more efficient: preventing any data input errors, accidental deletions, and facilitating the data update process. Understanding data normalization is very simple, but this process is more complex than it appears. Data normalization follows specific rules that dictate how a database should be organized.
Student ID | Subjects |
---|---|
1 | Math, English |
2 | Science, English |
3 | Math, Science |
In the table above, the "Subjects" column contains multiple values (not atomic), so it is not in 1NF.
Student ID | Subject |
---|---|
1 | Math |
1 | English |
2 | Science |
2 | English |
3 | Math |
3 | Science |
In this normalized table, each column contains only atomic values, satisfying the conditions of 1NF. We have split the "Subjects" column into individual rows, so each row represents a single, indivisible data point.
Second Normal Form (2NF)
Let's assume we have a database of a store's sales records:
InvoiceID | ProductID | ProductName | Quantity | UnitPrice | TotalPrice | CustomerName |
---|---|---|---|---|---|---|
101 | 1 | ProductA | 2 | 10 | 20 | Alice |
101 | 2 | ProductB | 1 | 20 | 20 | Alice |
102 | 1 | ProductA | 3 | 10 | 30 | Bob |
In this table, the combination of InvoiceID
and ProductID
forms the primary key. However, we can see that the ProductName
and UnitPrice
are dependent on ProductID
, and CustomerName
is dependent on InvoiceID
, not the entire primary key, so the table is not in 2NF.
To normalize this table into 2NF, we need to remove partial dependencies by splitting the table into multiple tables, so that all attributes in each table are functionally dependent on the entire primary key.
Invoice Table
InvoiceID | CustomerName |
---|---|
101 | Alice |
102 | Bob |
Product Table
ProductID | ProductName | UnitPrice |
---|---|---|
1 | ProductA | 10 |
2 | ProductB | 20 |
Sales Table
InvoiceID | ProductID | Quantity | TotalPrice |
---|---|---|---|
101 | 1 | 2 | 20 |
101 | 2 | 1 | 20 |
102 | 1 | 3 | 30 |
Now, the Product Table
contains information solely about products, functionally dependent on ProductID
, and the Invoice Table
contains information about invoices, dependent on InvoiceID
. The Sales Table
links products with invoices, with all attributes functionally dependent on the composite primary key (InvoiceID
and ProductID
). This database structure is in 2NF.
Third Normal Form (3NF)
Before Normalization (Not in 3NF)
Consider we have a database that stores information about employees and their departments:
EmployeeID |
EmployeeName |
DepartmentID |
DepartmentName |
DepartmentLocation |
---|---|---|---|---|
1 |
Alice |
100 |
HR |
Building A |
2 |
Bob |
100 |
HR |
Building A |
3 |
Charlie |
200 |
IT |
Building B |
In this table, the primary key is EmployeeID
. However, DepartmentName
and DepartmentLocation
are transitively dependent on EmployeeID
through DepartmentID
. Therefore, the table is not in 3NF.
After Normalization (In 3NF)
To normalize this table to 3NF, we need to remove the transitive dependencies by splitting the table into more tables, so that all attributes in each table are functionally dependent only on the primary key.
Employee Table
EmployeeID |
EmployeeName |
DepartmentID |
---|---|---|
1 |
Alice |
100 |
2 |
Bob |
100 |
3 |
Charlie |
200 |
Department Table
DepartmentID |
DepartmentName |
DepartmentLocation |
---|---|---|
100 |
HR |
Building A |
200 |
IT |
Building B |
Now, in the Employee Table
, every attribute is functionally dependent only on the primary key, EmployeeID
, and in the Department Table
, every attribute is functionally dependent only on the primary key, DepartmentID
. This database structure is in 3NF, as it has no transitive dependencies, and every non-key attribute is functionally dependent only on the primary key.
Now let's move on to creating an example that illustrates these rules:
Before Normalization (Not in BCNF)
Consider a database table storing information about courses and the professors teaching them in different semesters:
CourseID | ProfessorID | Semester | ProfessorName |
---|---|---|---|
101 | 1 | Spring | Dr. Smith |
102 | 2 | Fall | Dr. Johnson |
101 | 1 | Fall | Dr. Smith |
In this table, the combination of CourseID
, ProfessorID
, and Semester
form the composite primary key. The ProfessorName
is dependent on ProfessorID
, which is a part of the composite primary key, but not on the entire primary key, hence it's not in BCNF.
After Normalization (In BCNF)
We can normalize this table into BCNF by removing the partial dependency of ProfessorName
on the part of the composite primary key. We split the table into two:
Professor Table
ProfessorID | ProfessorName |
---|---|
1 | Dr. Smith |
2 | Dr. Johnson |
Course Table
CourseID | ProfessorID | Semester |
---|---|---|
101 | 1 | Spring |
102 | 2 | Fall |
101 | 1 | Fall |
Now, in the Professor Table
, the ProfessorName
is only dependent on ProfessorID
, and in the Course Table
, all attributes are dependent on the composite primary key (CourseID
, ProfessorID
, and Semester
), thus meeting the criteria for BCNF.
While normalization structures databases to minimize data redundancy, it can scatter related data across numerous separate tables. Retrieving all courses taught by a specific professor along with the respective classroom details necessitates a JOIN operation encompassing the "Professors", "Courses", and "Classrooms" tables. As the number of involved tables increases, such operations can considerably decelerate query responses.
Denormalization addresses this issue, strategically incorporating redundancy to diminish the necessity for JOIN operations during query executions. This accelerates certain read operations, enhancing database performance for specific tasks. This methodology involves amalgamating tables or augmenting existing ones with redundant data, aiming to streamline data retrieval processes. However, it's important to note that denormalization comes with hidden costs, including increased storage requirements and potential complexities in data maintenance and consistency.
The choice between normalization and denormalization isn't straightforward; it demands a nuanced comprehension of the unique workload, query dynamics, and performance indicators pertinent to the database. This knowledge facilitates the crafting of a balanced database design that harmonizes efficiency and data integrity.