Normalization is a set of guidelines to helps us improve database design and thus maintain quality data.
When a database is designed, there are several objectives related to maintaining good data:
- Reduce data redundancy, or the accidental duplication of data
- Prevent invalid or corrupt data
- Reduce "anomolies" or problems with inserting, updating and deleting records
- Improving the efficiency of the database
An anomaly is defined as "an outcome that is unusual or unexpected". In reference to database design and management, we use "anomaly" to describe situations where maintaining data is overly difficult or may introduce errors in the database.
What is Normalization?
A good definition (emphasis below is mine):
"Database normalization is typically a refinement process after the initial exercise of identifying the data objects that should be in the relational database, identifying their relationships and defining the tables required and the columns within each table”"
Two things to notice:
- the word “process”, meaning something that is often cyclical and does not always have a clear beginning or end, and
- word “after”, meaning an entity analysis and transition to creating the database often happens before any review for normalization happens
Normalization as Rules or Levels of Good Design
The concept and theory of "normalization" goes back to the 1970's and a researcher named E.F Codd. Data scientists still wrestle and try to improve on the overall science of how databases should be designed and work.
The idea of normalization is conceptualized in terms of "normal forms". These can also be thought of as "levels" of proper and good design.
The first level of good design is called First Normal Form, or 1NF.
The objectives of 1NF is to make sure the database has basic tables that do not have glaring problems. We normally look for the following:
- Objects are represented by tables and columns
- No column contains multiple values
- No set of columns are used to represent the same value(s)
- Duplicate table rows are eliminated
Remember that normalization is a process. All databases may be slightly different and present different issues, so normalizing (improving) the database design is a step-by-step process, and may require different steps in each case.
Also, realize that in most IT endeavors - including database design - there is no single, correct answer. There may many good or better answers, and some bad or wrong answers, but there may be multiple excellent and correct answers.
Let’s consider a simple example. Let's says Sasha has started a private school. She has started keeping track of students and classes in a spreadsheet. Here is what her data looks like so far:
|Anna||Math 100, Eng 100, Geo 100||Mr. Jones, Mrs. Sayed, Mr. Wilson||T 1:00, W 2:00, R 9:00||1st Year|
|Ben||Eng 200, Geo 100, Spanish 100||Mrs. Sayed, Mr. Wilson, Ms. Lopez||R 2:20, R 9:00, W 10:00||2nd Year|
|Casper||Geo 100, Math 200 Spanish 100||Mr. Wilson, Mrs. Sayed, Ms. Lopez||R 9:00 T 3:00 W 10:00||2nd Year|
|Drew||Eng 100, Math 100 Spanish 100||Mrs. Sayed, Mr. Jones, Ms. Lopez||W 2:00 T 1:00 W 10:00||1st Year|
|Evelyn||Eng 100, Math 200 Spanish 100||Mrs. Sayed, Mr. Jones, Ms. Lopez||W 2:00 T 3:00 W 10:00||1st Year|
It is a bit difficult to read, but it accurately reflects the schedules of the students (and teachers).
What if we need to change the class meeting time of Eng 100 to Monday at 11:00 AM? How many records do we have to change? How would we do it?
We can see that having to change a class meeting time is a difficult thing to do. If we had hundreds of students, it might be a very long process.
When it is very difficult to accurately update records, we call it an Update Anomaly. That means it very difficult to update records, or that updating data may cause errors and data corruption. What if we updated some records for Anna, Ben and Drew and not Evelyn? If someone then wanted to know “What time does Eng 100 meet?”, what would be the correct answer? How would we know?
What if we want to add a teacher that has not yet been assigned a class?
What if we want to add a course that does not yet have any students?
In both of these cases, there is a problem (an anomaly). It is not possible to add either a course or teacher, because the above list is centered on students. Teachers and courses are additional, associated bits of information, but we do not have a way to track teachers or classes outside of tracking students.
Let’s take a stab at rearranging Sasha’s data. The new table below is an obvious way to reorganize the same data:
|Student||Course1||Course 2||Course 3||Class||Grade|
|Anna||Math 100, Mr. Jones. T 1:00||Eng 100, Mrs. Sayed, W 2:00||Geo 100, Mr. Wilson, R 9:00||1st Year|
|Ben||Eng 200, Mrs. Sayed, R 2:20||Geo 100, Mr. Wilson, R 9:00||Spanish 100, Ms. Lopez, W 10:00||2nd Year|
|Casper||Geo 100, Mr. Wilson, R 9:00||Math 200 Mrs. Sayed, T 3:00||Spanish 100, Ms. Lopez, W 10:00||2nd Year|
|Drew||Eng 100, Mrs. Sayed, W 2:00||Math 100, Mr. Jones. T 1:00||Spanish 100, Ms. Lopez, W 10:00||1st Year|
|Evelyn||Eng 100, Mrs. Sayed, W 2:00||Math 200 Mrs. Sayed, T 3:00||Spanish 100, Ms. Lopez, W 10:00||1st Year|
This might be easier to read, and it accurately reflects the schedules of the students (and teachers). But does it remove any of the anomalies?
Not really. To update a single course, we would have to check three different columns for every student. It is still very difficult to update data.
The Insert Anomalies are the same.
We also have another problem. What if a student wants to take a fourth course? We could add a column, but what if someone wanted to take a fifth course? This design could get very messy.
How Do We Normalize (Improve) This Design?
We already know there are multiple entities to track for this database. We would like to add and track new teachers and new courses. We assume we would use entities to track teachers and courses.
Remember that there is not a one-to-one correspondence between entities and tables. Entities do not always become tables. Sometimes, an entity becomes more than a single table.
Anytime we review existing tables and see any of three conditions, it suggests we need more tables:
- Columns contain multiple values, such as “course1, course2 and course3”
- There are multiple columns that are really tracking the same thing, such as columns for Course1, Course2 and Course3
- The same string values are appearing in many records, such as “Eng 100” or “Mrs. Sayed”
Since we know we have additional entities, what would those look like? Stepping away from Sasha’s data, what would entities for Courses and Teachers look like?
At a minimum, we would probably need at least the following:
If we assume these will both become tables and add those to the database design, what data would these tables contain?
|101||Eng 100||W 2:00|
|102||Math 100||T 1:00|
|103||Geo 100||R 9:00|
|104||Span 100||W 10:00|
|105||Eng 200||R 2:00|
|106||Math 200||W 3:00|
How do these tables help? Do they eliminate the Insert Anomalies we identified originally? Yes. We can add a teacher and a class, independent of whether students have classes associated with either object.
Do these tables eliminate the Update Anomalies? Yes, at least for Teachers and Courses. We can update any single record easily and in one place. We will not accidently update other records or miss updating all the relevant records. If we change the time for a course, there is only a single place where that course-time exists, so we do not have the possibility of corrupt data (if our entry is correct), and we can always answer the questions definitively as to what time that course meets.
What about the original table? What should it look like now? Well, we haven’t solved anything there yet. If we create foreign key columns for teachers and courses, we still end up with problems.
|Student||Course1||Teacher1||Course 2||Teacher2||Course 3||Teacher3||Class||Grade|
This looks better, but we still have issues. We have multiple columns tracking the same data( Course1, Course2,Course3). The main one is that we are limited to three courses, and adding more becomes messy.
But by now we should have realized something else, too. We have another entity that we have not defined yet! There should be a conceptual entity for Schedules.
This type of entity is sometimes harder to recognize. But by looking at the table above, we can see that there are multiple sets of the same Course-Teacher combination. That suggest we need another table. If we have already recognized that we need another entity, anyway, we should define that.
What is more interesting is that the new Schedules entity does not really contain any data other than pointers or foreign keys to other entities.
If we consider what data might move to a new Schedules table and what would be left in a revised Students table, we might be surprised.
The first thing to admit is that this is much harder to read than anything else we had before. But, of course, we do not design a database for readability. We will use a front-end application to query the database and present the information in ways that are useful.
But we have eliminated the anomalies. Or have we? If we were not careful, we did not realize that our design is not yet entirely finished. We still have an update anomaly. What if we wanted to change the teacher for Geo 100 (id=103). There are three students assigned to that course. It is possible that we could change two instances of the courseid and not the last one.
We missed something. Teachers should not be assigned to
Schedules. The teacher for each course is always the same. Teachers should be
assigned to Courses. That eliminates the anomaly, and we realize that is the
The final versions of Courses and Schedules is below.
|101||Eng 100||W 2:00||2|
|102||Math 100||T 1:00||1|
|103||Geo 100||R 9:00||3|
|104||Span 100||W 10:00||4|
|105||Eng 200||R 2:00||2|
|106||Math 200||W 3:00||1|
Normalization often occurs after a database has been created. It is a process that may involve multiple passes and changes. The normalization “forms” are guidelines to help guide the design process.