Do you know about Data Normalization ??
Database Normalization with SQL
Hey SQL enthusiast
This
is a guide on Database Normalization and how to implement normalization using
SQL, let's get started.
What is Database
Normalization
Normalization is a technique used for evaluating and reorganizing table
structures to minimize redundancies, improve data integrity, improve storage
efficiency, and reduce the need to re-design the database if new data is
introduced. Database normalization removes data inconsistencies which may cause
the analysis of our data to be more complicated. These inconsistencies could
come from updating records, inserting, or deleting records, which is known as
update anomaly, insertion, and deletion anomaly, normalization also includes
the removal of duplicate records which in turn saves storage.
The figure below shows the stages of Database normalization.
Let's take a look at the first normal form, what does it take for a
database table to be in the first normal form. Here are the rules for a table
to be in first normal form (1NF)
- Each column should contain
an atomic value
- A column should contain
data of the same data type eg (if a column contains an integer all the
other elements in the column should be an integer)
- Each column should have a
unique column name
- The order in which data are
stored in the columns does not matter.
let's take a look at a table to better understand the concept of 1NF
Records Table Not Normalize
As you can see from the above table there are columns with duplicate
elements in the database table, from the last row, the columns are (student
major Std_major, courses title CTitle, and Instructor location Ilocation).
to solve this problem we are going to create a new table and insert only one
atomic value into every column in the table.
Here is the code used to create the table and insert the element into the table from the original table.
reating a new table for the schools Record 1NF
CREATE TABLE Schools_Record_1NF (Std_ID VARCHAR(50) NOT NULL,
Std_Name VARCHAR(50),
Std_address VARCHAR(50),
Std_Major VARCHAR(50),
CID VARCHAR NOT NULL,
CTitle VARCHAR(50),
Iname VARCHAR(50),
Ilocation VARCHAR(100),
Grade CHAR(5))
--Inserting into the Schools Record 1NF table
INSERT INTO Schools_Record_1NF(Std_ID, Std_Name,
Std_address, Std_Major,
CID, CTitle, Iname,
Ilocation, Grade)
--Using Case statement to deal with columns with duplicate entries
SELECT Std_ID, Std_Name, Std_address,
-- The CASE Statement here checks the first entry in the column Using (SUBSTR AND INSTR)
CASE WHEN
SUBSTR(Std_Major, 1, INSTR(Std_Major, ',')-1)='' THEN Std_Major
ELSE SUBSTR(Std_Major,1,INSTR(Std_Major, ',')-1)
END AS Std_Major,
CASE WHEN
SUBSTR(CID,1,INSTR(CID, ' ')-1)='' THEN CID
ELSE SUBSTR(CID,1,INSTR(CID, ' ')-1)
END AS CID,
CASE WHEN
SUBSTR(CTitle,1,INSTR(CTitle,',')-1)='' THEN CTitle
ELSE SUBSTR(CTitle,1,INSTR(CTitle,',')-1)
END AS CTitle,
CASE WHEN
SUBSTR(Iname,1,INSTR(Iname,',')-1)='' THEN Iname
ELSE SUBSTR(Iname,1,INSTR(Iname,',')-1)
END AS Iname,
CASE WHEN
SUBSTR(Ilocation,1,INSTR(Ilocation, ',')-1)='' THEN Ilocation
ELSE SUBSTR(Ilocation,1,INSTR(Ilocation, ',')-1)
END AS Ilocation,
CASE WHEN
SUBSTR(Grade,1,INSTR(Grade, ',')-1)='' THEN Grade
ELSE SUBSTR(Grade,1,INSTR(Grade, ',')-1)
END AS Grade
FROM School_Record
-- Using the union operator to join the first and the second entry of the table
UNION
SELECT Std_ID, Std_Name, Std_address,
-- Using SUBSTR and INSTR to access the second entry inside the columns
SUBSTR(Std_Major,INSTR(Std_Major,',')+1) As Std_Major,
SUBSTR(CID, INSTR(CID,' ')+1) AS CID,
SUBSTR(CTitle, INSTR(CTitle, ',')+1) AS CTitle,
SUBSTR(Iname, INSTR(Iname, ',')+1) AS Iname,
SUBSTR(Ilocation, INSTR(Ilocation, ',')+1) AS Ilocation,
SUBSTR(Grade, INSTR(Grade, ',')+1) AS Grade
FROM School_Record
-- Checking the columns with multiple entries
WHERE Std_Major LIKE "%,%" or CID LIKE "% %"
Here is the new table after applying the above code
Records Table in 1NF
one new row is added to the table
rather than having columns with non-atomic values.
Next up let's move into the second normal form (2NF)
The Second Normal Form (2NF)
Table not in 2NF
The above table is not in 2NF because not all the columns in the table
are depending on the primary keys of the table. e.g. columns like the student
name Std_name only depends on the student Id Std_ID column
which is one part of the primary key since both Std_ID and CID are
the primary keys in this case and also courses title only depends on courses
Id CID without depending on student Id Std_ID which means
we are having a partial functional dependency.
To Convert a table to Second Normal Form
- The table must be in 1NF
- A table in 2NF should not
have a partial functional dependency.
What is a partial functional dependency?
let's understand Functional dependency first
Dependency in a table is when the columns in the table are depending on a
particular column like the primary key of a table.
Partial functional dependency is when a particular column is not full depending
on all the primary key columns of the table (in this case our primary key is
more than one column).
To implement the second normal form 2NF we will start by
creating new tables one for the student information, instructor information,
and another table for the student grade information.
--Creating the Tables for the Second Normal Form 2NF
CREATE TABLE Student_Info_2NF (
Std_ID VARCHAR(50) NOT NULL,
Std_Name VARCHAR(50),
Std_address VARCHAR(50),
Std_Major VARCHAR(50),
PRIMARY KEY(Std_ID));
CREATE TABLE Instructor_Info_2NF(
CID VARCHAR NOT NULL,
CTitle VARCHAR(50),
Iname VARCHAR(50),
Ilocation VARCHAR(100),
PRIMARY KEY(CID));
CREATE TABLE Grade_Info_2NF(
Std_ID VARCHAR(50) NOT NULL,
CID VARCHAR NOT NULL,
Grade CHAR(5),
PRIMARY KEY(Std_ID, CID),
FOREIGN KEY(Std_ID) REFERENCES Student_Info_2NF(Std_ID) ,
FOREIGN KEY(CID) REFERENCES Instructor_Info_2NF(CID));
students Information table
Instructor information table
students grade information table
Next, we will be moving on to the 3NF
Third Normal Form 3NF
For Relation to be in 3NF the Relation First must be in 1NF and 2NF
- They should be no transitive
dependencies
Well, transitive dependency simply means that all other
columns in the table should only depend on the PRIMARY KEY Column only and
primary key alone.
Based on the tables we currently have,
there is only one table that is showing an issue which is the instructor
information table because the Primary key here is the course id (CID) and the
instructor location can be determined by the CID and it can also be
determined by the instructor's name Iname column, so to solve this
problem we are going to create a new table for the Instructor information and
another table with the courses information.
here is the code to create the table and insert the data into the table
from the Instructor information table above.
CREATE TABLE Instructor_Info_3NF (
Iname VARCHAR(50) NOT NULL,
Ilocation VARCHAR(100),
PRIMARY KEY (Iname));
CREATE TABLE Courses_Info_3NF(
CID VARCHAR NOT NULL,
CTitle VARCHAR(50),
Iname VARCHAR(50),
PRIMARY KEY(CID)
FOREIGN KEY(Iname) REFERENCES Instructor_Info_3NF(Iname));
--Inserting into the Instructor_Info_3NF
INSERT INTO Instructor_Info_3NF(Iname, Ilocation)
SELECT Iname, Ilocation
FROM Instructor_Info_2NF;
--Inserting into the Courses_Info_3NF
INSERT INTO Courses_Info_3NF(CID, CTitle, Iname )
SELECT CID, CTitle, Iname
FROM Instructor_Info_2NF;
Here are the new tables
Instructor information table 3NF
Courses information table 3NF
We are going to stop on the 3NF
Conclusion
We have now completed the normalization process, In general, most Databases are
in 3NF, in this post you've understood what normalization is and also learned
how to implement normalization using SQL.
Hope this blog helps you understand Database normalization. If you enjoy
my blog please don't forget to like or comment on the article. It will
encourage me to write more articles.
- Alsadiq Abubakar
- Mar, 27 2022