72 CHAPTER 2: NORMALIZATION
section_id CHAR(1) NOT NULL,
grade CHAR(1) NOT NULL);
CREATE TABLE Students
(student_name CHAR (25) NOT NULL PRIMARY KEY,
major CHAR(10) NOT NULL);
A common misunderstanding about relational theory is that 3NF
tables have no transitive dependencies. As indicated above, if X
→
Y, X
does not have to be a key if Y is part of a candidate key. We still have a
transitive dependency in the example—(room_nbr, time_period)
→
(course_name, section_id)—but since the right side of the dependency
is a key, it is technically in 3NF. This table structure still has
unreasonable behavior, though: several courses can be assigned to the
same room at the same time.
Another form of transitive dependency is a computed column. For
example:
CREATE TABLE Stuff
(width INTEGER NOT NULL,
length INTEGER NOT NULL,
height INTEGER NOT NULL,
volume INTEGER NOT NULL
CHECK (width * length * height = volume),
PRIMARY KEY (width, length, height));
The volume column is determined by the other three columns, so any
change to one of the three columns will require a change to the volume
column. You can use a
VIEW
to create computed columns.
2.5 Elementary Key Normal Form (EKNF)
Elementary Key Normal Form (EKNF) is a subtle enhancement on 3NF.
By definition, EKNF tables are also in 3NF. This happens when there is
more than one unique composite key and they overlap. Such cases can
cause redundant information in the overlapping column(s). For
example, in the following table, let’s assume that a subject title is also a
unique identifier for a given subject in the following table:
CREATE TABLE Enrollment
(student_id INTEGER NOT NULL,
2.6 Boyce-Codd Normal Form (BCNF) 73
course_code CHAR(6) NOT NULL,
course_name VARCHAR(15) NOT NULL,
PRIMARY KEY (student_id, course_name)
, UNIQUE (student_id, course_name) alternative key
);
Enrollment
student_id course_name course_name
=======================================
1 'CS-100' 'ER Diagrams'
1 'CS-114' 'Database Design'
2 'CS-114' 'Database Design'
This table, although it is in 3NF, violates EKNF. The primary key of
the above table is the combination of (student_id, course_name).
However, we can also see an alternative key (student_id, course_name)
as well. The above schema could result in update and deletion
anomalies, because values of both course_name and course_name tend
to be repeated for a given subject.
The following schema is a decomposition of the above table to satisfy
EKNF:
CREATE TABLE Subjects
(course_name CHAR(6) NOT NULL PRIMARY KEY,
course_name VARCHAR(15) NOT NULL);
CREATE TABLE Enrollment
(student_id INTEGER NOT NULL,
course_name CHAR(6) NOT NULL,
PRIMARY KEY (student_id, course_name));
For reasons that will become obvious in the following section, most
designers do not ensure a table is in EKNF, as they will move directly on
to Boyce-Codd Normal Form after ensuring that a schema is in 3NF.
Thus, EKNF is included here only for reasons of historical accuracy and
completeness.
2.6 Boyce-Codd Normal Form (BCNF)
A table is in BCNF when for all nontrivial FDs (X
→
A), X is a superkey
for the whole schema. A superkey is a unique set of columns that
74 CHAPTER 2: NORMALIZATION
identify each row in a table, but you can remove some columns from it
and it will still be a key. Informally, a superkey is carrying extra weight.
BCNF is the normal form that actually removes all transitive
dependencies. A table is in BCNF if for all (X
→
Y), X is a key—period.
We can go to this normal form just by adding another key with
UNIQUE
(room, time_period) constraint clause to the table Classes.
There are some other interesting and useful “higher” normal forms,
but they are outside of the scope of this discussion. In our example, we
have removed all of the important anomalies with BCNF.
3NF was concerned with the relationship between key and nonkey
columns. However, a column can often play both roles. Consider a table
for computing salesmen’s bonus gifts, which has for each salesman his
base salary, the number of sales points he has won in a contest, and the
bonus gift awarded for that combination of salary range and points. For
example, we might give a fountain pen to a beginning salesman with a
base pay rate somewhere between $15,000 and $20,000 and 100 sales
points, but give a car to a master salesman whose salary is between
$30,000 and $60,000 and who has 200 points. The functional
dependencies are, therefore:
(paystep, points) -> gift
gift -> points
Let’s start with a table that has all the data in it and normalize it.
Gifts
Salary_amt points gift
==========================
15000 100 'Pencil'
17000 100 'Pen'
30000 200 'Car'
31000 200 'Car'
32000 200 'Car'
This schema is in 3NF, but it has problems. You cannot insert a new
gift into our offerings and points unless you have a salary to go with it. If
you remove any sales points, you lose information about the gifts and
salaries (e.g., only people in the $30,000 range can win a car). And,
finally, a change in the gifts for a particular point score would have to
affect all the rows within the same pay step. This table needs to be
broken apart into two tables:
2.7 Fourth Normal Form (4NF) 75
PayGifts
Salary_amt gift
=================
15000 'Pencil'
17000 'Pen'
30000 'Car'
31000 'Car'
32000 'Car'
GiftsPoints
gift points
===============
'Pencil' 100
'Pen' 100
'Car' 200
2.7 Fourth Normal Form (4NF)
Fourth Normal Form (4NF) makes use of multivalued dependencies.
The problem it solves is that the table has too many of them. For
example, consider a table of departments, their projects, and the parts
they stock. The MVDs in the table would be:
dept_name ->> jobs
dept_name ->> parts
Assume that dept_name ‘d1’ works on jobs ‘j1’ and ‘j2’ with parts ‘p1’
and ‘p2’; that dept_name ‘d2’ works on jobs ‘j3’, ‘j4’, and ‘j5’ with parts
‘p2’ and ‘p4’; and that dept_name ‘d3’ works only on job ‘j2’ with parts
‘p5’ and ‘p6’. The table would look like this:
dept job part
=================
'd1' 'j1' 'p1'
'd1' 'j1' 'p2'
'd1' 'j2' 'p1'
'd1' 'j2' 'p2'
'd2' 'j3' 'p2'
'd2' 'j3' 'p4'
'd2' 'j4' 'p2'
76 CHAPTER 2: NORMALIZATION
'd2' 'j4' 'p4'
'd2' 'j5' 'p2'
'd2' 'j5' 'p4'
'd3' 'j2' 'p5'
'd3' 'j2' 'p6'
If you want to add a part to a dept_name, you must create more than
one new row. Likewise, removing a part or a job from a row can destroy
information. Updating a part or job name will also require multiple rows
to be changed.
The solution is to split this table into two tables, one with
(dept_name, jobs) in it and one with (dept_name, parts) in it. The
definition of 4NF is that we have no more than one MVD in a table. If a
table is in 4NF, it is also in BCNF.
2.8 Fifth Normal Form (5NF)
Fifth Normal Form (5NF), also called the Join-Projection Normal Form
or the Projection-Join Normal Form, is based on the idea of a lossless
JOIN or the lack of a join-projection anomaly. This problem occurs
when you have an n-way relationship, where n > 2. A quick check for
5NF is to see if the table is in 3NF and all the candidate keys are single
columns.
As an example of the problems solved by 5NF, consider a table of
house notes that records the buyer, the seller, and the lender:
HouseNotes
buyer seller lender
==================================
'Smith' 'Jones' 'National Bank'
'Smith' 'Wilson' 'Home Bank'
'Nelson' 'Jones' 'Home Bank'
This table is a three-way relationship, but because many CASE tools
allow only binary relationships, it might have to be expressed in an E-R
(entity-relationship) diagram as three binary relationships, which would
generate
CREATE TABLE statements leading to these tables:
2.8 Fifth Normal Form (5NF) 77
BuyerLender
buyer lender
=============================
'Smith' 'National Bank'
'Smith' 'Home Bank'
'Nelson' 'Home Bank'
SellerLender
seller lender
=======================
'Jones' 'National Bank'
'Wilson' 'Home Bank'
'Jones' 'Home Bank'
BuyerSeller
buyer seller
================
'Smith' 'Jones'
'Smith' 'Wilson'
'Nelson' 'Jones'
The trouble is that when you try to assemble the original information
by joining pairs of these three tables together as follows:
SELECT BS.buyer, SL.seller, BL.lender
FROM BuyerLender AS BL,
SellerLender AS SL,
BuyerSeller AS BS
WHERE BL.buyer = BS.buyer
AND BL.lender = SL.lender
AND SL.seller = BS.seller;
you will recreate all the valid rows in the original table, such as (‘Smith’,
‘Jones’, ‘National Bank’), but there will also be false rows, such as
(‘Smith’, ‘Jones’, ‘Home Bank’), which were not part of the original table.
This is called a join-projection anomaly.
There are also strong JPNFs and overstrong JPNFs, which make use
of
JOIN dependencies (JDs for short). Unfortunately, there is no
systematic way to find a JPNF or 4NF schema, because the problem is
known to be NP-complete.
78 CHAPTER 2: NORMALIZATION
As an aside, 3NF is very popular with CASE tools, and most of them
can generate a schema where all of the tables are in 3NF. They obtain the
FDs from an E-R diagram or from a statistical analysis of the existing
data, then put them together into tables and check for normal forms.
The bad news is that it is often possible to derive more than one 3NF
schema from a set of FDs. Most CASE tools that produce an E-R diagram
will find only one of them, and go no further. However, if you use an
ORM (Object Role Model) tool properly, the schema will be in 5NF. I
suggest strongly that you get any of the books by Terry Halpin on this
technique: Information Modeling and Relational Databases from Conceptual
Analysis to Logical Design, ISBN 1-55860-672-6; Database Modeling with
Microsoft® Visio for Enterprise Architects, ISBN 1-55860-919-9.
2.9 Domain-Key Normal Form (DKNF)
Ronald Fagin defined Domain/Key Normal Form (DKNF) in 1981 as a
schema that has all of the domain constraints and functional
dependencies enforced. There is not yet a general algorithm that will
always generate the DKNF solution, given a set of constraints. We can,
however, determine DKNF in many special cases, and it is a good guide
to writing DDL in the real world.
Let’s back up a bit and look at the mathematical model under
normalization. An FD has a defined system of axioms that can be used in
normalization problems. These six axioms, known as Armstrong’s
axioms, are given below:
Reflexive: X → X
Augmentation: If X → Y then XZ → Y
Union: If (X → Y and X → Z) then X → YZ
Decomposition: If X → Y and Z a subset of Y, then X → Z
Transitivity: If (X → Y and Y → Z) then X → Z
Pseudotransitivity: If (X → Y and YZ → W) then XZ → W
They make good sense if you just look at them, which is something
we like in a set of axioms. In the real world, the FDs are the business
rules we are trying to model.
2.9 Domain-Key Normal Form (DKNF) 79
In the normalization algorithm for 3NF (developed by P. A. Berstein,
1976) we use the axioms to get rid of redundant FDs. For example, if we
are given:
A -> B
A -> C
B -> C
DB -> E
DAF -> E
A → C is redundant, because it can be derived from A → B and B →
C with transitivity. Also, DAF → E is redundant, because it can be
derived from DB → E and A → B with transitivity (which gives us DA →
E) and augmentation (which then allows DAF → E). What we would like
to find is the smallest set of FDs from which we can generate all of the
given rules. This is called a nonredundant cover. For the FDs above, one
cover would be:
A -> B
B -> C
DB -> E
Once we do this, Berstein shows that we can just create a table for
each of the FDs where A, B, and DB are the respective keys. We have
taken it easy so far, but now it’s time for a challenge.
As an example of a schema with multiple 3NF tables, here is a
problem that was used in a demonstration by DBStar Corporation (now
Evoke Software). The company used it as an example in a demonstration
that comes with their CASE tool.
We are given an imaginary and simplified airline that has a database
for scheduling flights and pilots. Most of the relationships are obvious
things. Flights have only one departure time and one destination. They
can get a different pilot and can be assigned to a different gate each day
of the week. The FDs for the database are given below:
1) flight -> destination
2) flight -> hour
3) (day, flight) -> gate
4) (day, flight) -> pilot
5) (day, hour, pilot) -> gate
6) (day, hour, pilot) -> flight
80 CHAPTER 2: NORMALIZATION
7) (day, hour, pilot) -> destination
8) (day, hour, gate) -> pilot
9) (day, hour, gate) -> flight
10) (day, hour, gate) -> destination
A purist, looking at this collection of FDs, will be bothered by the
redundancies in this list. But in the real world, when you interview
people, they do not speak to you in a minimal set; they state things that
they know to be true in their situation. In fact, they very often leave out
relationships that they consider too obvious to mention.
Your problem is to find 3NF or stronger database schemas in these
FDs. You have to be careful! You have to have all of the columns,
obviously, but your answer could be in 3NF and still ignore some of the
FDs. For example, this will not work:
CREATE TABLE PlannedSchedule
(flight, destination, hour, PRIMARY KEY (flight));
CREATE TABLE ActualSchedule
(day, flight, gate, pilot, PRIMARY KEY (day, flight));
If we apply the Union axiom to some of the FDs, we get:
(day, hour, gate) -> (destination, flight, pilot)
(day, hour, pilot) -> (destination, flight, gate)
This example says that the user has required that if we are given a
day, an hour, and a gate we should be able to determine a unique flight
for that day, hour, and gate. We should also be able to determine a
unique flight given a day, hour, and pilot.
Given the PlannedSchedule and ActualSchedule tables, you cannot
produce views where either of the two constraints we just mentioned is
enforced. If the query “What flight does pilot X have on day Y and hour
Z?” gives you more than one answer, it violates the FDs and common
sense. Here is an example of a schema that is allowable in this proposed
schema, but is undesirable given our constraints:
PlannedSchedule
flight hour destination
=========================
118 17:00 Dallas
2.9 Domain-Key Normal Form (DKNF) 81
123 13:00 Omaha
155 17:00 Los Angeles
171 13:00 New York
666 13:00 Dis
ActualSchedule
day flight pilot gate
========================
Wed 118 Tom 12A
Wed 155 Tom 13B
Wed 171 Tom 12A
Thu 123 John 12A
Thu 155 John 12A
Thu 171 John 13B
The constraints mean that we should be able to find a unique answer
to each of the following questions and not lose any information when
inserting and deleting data:
1. Which flight is leaving from gate 12A on Thursdays at 13:00?
This question looks fine until you realize that you don’t know
any further information about flight 666, which was not
required to have anything about its day or pilot in the Actu-
alSchedule table. Likewise, I can add a flight to the Actu-
alSchedule table that has no information in the
PlannedSchedule table.
2. Which pilot is assigned to the flight that leaves gate 12A on
Thursdays at 13:00? This has the same problem as before.
3. What is the destination of the flight in query 1 and 2? This has
the same problem as before.
4. What gate is John leaving from on Thursdays at 13:00?
5. Where is Tom flying to on Wednesdays at 17:00?
6. What flight is assigned to Tom on Wednesdays at 17:00?
It might help if we gave an example showing how one of the FDs in
the problem can be derived using the axioms of FD calculus, just like
you would do a geometry proof: