Database Normalization

DATABASE NORMALIZATION – DEFINITIONS

INTRODUCTION

According to (Elmasri & Navathe, 1994), the normalization process, as first proposed by Codd (1972), takes a relation schema through a series of tests to "certify" whether or not it belongs to a certain normal form. Initially, Codd proposed three normal forms, which he called first, second, and third normal form. A stronger definition of 3NF was proposed later by Boyce and Codd and is known as Boyce-Codd normal form (BCNF). All these normal forms are based on the functional dependencies among the attributes of a relation. Later, a fourth normal form (4NF) and a fifth normal form (5NF) were proposed, based on the concepts or multivalued dependencies and join dependencies, respectively.

Normalization of data can be looked on as a process during which unsatisfactory relation schemas are decomposed by breaking up their attributes into smaller relation schemas that possess desirable properties. One objective of the original normalization process is to ensure that the update anomalies do not occur.

Normal forms provide database designers with:

· A formal framework for analyzing relation schemas based on their keys and on the functional dependencies among their attributes.

· A series of tests that can be carried out on individual relation schema so that the relational database can be normalized to any degree. When a test fails, the relation violating that test must be decomposed into relations that individually meet the normalization tests.

Normal forms, when considered in isolation from other factors, do not guarantee a good database design. It is generally not sufficient to check separately that each relation schema in the database is, say, in BCNF or 3NF. Rather, the process of normalization through decomposition must also confirm the existence of additional properties that the relational schemas, taken together, should possess. Two of these properties are: · The lossless join or nonadditive join property, which guarantees that the spurious tuple problem does not occur.

· The dependency preservation property, which ensures that all functional dependencies are represented in some of the individual resulting relations.

DEFINITIONS

A relation is defined as a set of tuples. By definition, all elements of a set are distinct; hence, all tuples in a relation must also be distinct. This means that no two tuples can have the same combination of values for all their attributes.

Any set of attributes of a relation schema is called a superkey. Every relation has at least one superkey—the set of all its attributes. A key is a minimal superkey, i.e., a superkey from which we cannot remove any attribute and still have the uniqueness constraint hold.

In general, a relation schema may have more than one key. In this case, each of the keys is called a candidate key. It is common to designate one of the candidate keys as the primary key of the relation. A foreign key is a key in a relation R but it's not a key (just an attribute) in other relation R' of the same schema.

Integrity Constraints: the entity integrity constraint states that no primary key value can be null. This is because the primary key value is used to identify individual tuples in a relation; having null values for the primary key implies that we cannot identify some tuples. The referential integrity constraint is specified between two relations and is used to maintain the consistency among tuples of the two relations. Informally, the referential integrity constraint states that a tuple in one relation that refers to another relation must refer to an existing tuple in that relation.

An attribute of a relation schema R is called a prime attribute of the relation R if it is a member of any key of the relation R. An attribute is called nonprime if it is not a prime attribute—that is, if it is not a member of any candidate key.

A functional dependency, denoted by X->Y, between two sets of attributes X and Y that are subsets of R specifies a constraints on the possible tuples that can form a relation instance of R.

NORMAL FORMS

First Normal Form (1NF)

First normal form is now considered to be part of the formal definition of a relation; historically, it was defined to disallow multivalued attributes, composite attributes, and their combinations. It states that the domains of attributes must include only atomic (simple, indivisible) values and that the value of any attribute in a tuple must be a single value from the domain of that attribute.

Practical Rule¹: "Eliminate Repeating Groups," i.e., make a separate table for each set of related attributes, and give each table a primary key.

Formal Definition²: A relation is in first normal form (1NF) if and only if all underlying simple domains contain atomic values only.

Second Normal Form (2NF)

Second normal form is based on the concept of fully functional dependency. A functional X->Y is a fully functional dependency is removal of any attribute A from X means that the dependency does not hold any more. A relation schema is in 2NF if every nonprime attribute in relation is fully functionally dependent on the primary key of the relation. It also can be restated as: a relation schema is in 2NF if every nonprime attribute in relation is not partially dependent on any key of the relation.

Practical Rule¹: "Eliminate Redundant Data," i.e., if an attribute depends on only part of a multivalued key, remove it to a separate table.

Formal Definition²: A relation is in second normal form (2NF) if and only if it is in 1NF and every nonkey attribute is fully dependent on the primary key.

Third Normal Form (3NF)

Third normal form is based on the concept of transitive dependency. A functional dependency X->Y in a relation is a transitive dependency if there is a set of attributes Z that is not a subset of any key of the relation, and both X->Z and Z->Y hold. In other words, a relation is in 3NF if, whenever a functional dependency X->A holds in the relation, either (a) X is a superkey of the relation, or (b) A is a prime attribute of the relation.

Practical Rule¹: "Eliminate Columns not Dependent on Key," i.e., if attributes do not contribute to a description of a key, remove them to a separate table.

Formal Definition²: A relation is in third normal form (3NF) if and only if it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key.

Boyce-Codd Normal Form (BCNF)

Boyce-Codd normal form is stricter than 3NF, meaning that every relation in BCNF is also in 3NF; however, a relation in 3NF is not necessarily in BCNF. A relation schema is an BCNF if whenever a functional dependency X->A holds in the relation, then X is a superkey of the relation. The only difference between BCNF and 3NF is that condition (b) of 3NF, which allows A to be prime if X is not a superkey, is absent from BCNF.

Formal Definition²: A relation is in Boyce/Codd normal form (BCNF) if and only if every determinant is a candidate key. [A determinant is any attribute on which some other attribute is (fully) functionally dependent.]

Fourth Normal Form (4NF)

Multivalued dependencies are a consequence of first normal form, which disallowed an attribute in a tuple to have a set of values. If we have two or more multivalued independent attributes in the same relation schema, we get into a problem of having to repeat every value of one of the attributes with every value of the other attribute to keep the relation instances consistent.

Fourth normal form is based on multivalued dependencies, which is violated when a relation has undesirable multivalued dependencies, and hence can be used to identify and decompose such relations. A relation scheme R is in 4NF with respect to a set of dependencies F is, for every nontrivial multivalued dependency X->F, X is a superkey for R.

Practical Rule¹: "Isolate Independent Multiple Relationships," i.e., no table may contain two or more 1:n or n:m relationships that are not directly related.

Formal Definition²: A relation R is in fourth normal form (4NF) if and only if, whenever there exists a multivalued dependency in the R, say A->>B, then all attributes of R are also functionally dependent on A.

Fifth Normal Form (5NF)

In some cases there may be no losses join decomposition into two relation schemas but there may be a losses join decomposition into more than two relation schemas. These cases are handled by the join dependency and fifth normal form, and it’s important to note that these cases occur very rarely and are difficult to detect in practice.

Practical Rule¹: "Isolate Semantically Related Multiple Relationships," i.e., there may be practical constraints on information that justify separating logically related many-to-many relationships.

Formal Definition²: A relation R is in fifth normal form (5NF)—also called projection-join normal form (PJNF)—if and only if every join dependency in R is a consequence of the candidate keys of R.

A join dependency (JD) specified on a relations schema R, specifies a constraint on instances of R. The constraint states that every legal instance of R should have a losses join decomposition into sub-relations of R, that when reunited make the entire relation R. A relation schema R is in fifth normal form (5NF) (or project-join normal form (PJNF)) with respect to a set F of functional, multivalued, and join dependencies if, for every nontrivial join dependency JD(R1, R2, …, Rn) in F (implied by F), every Ri is a superkey of R.

Domain Key Normal Form (DKNF)

We can also always define stricter forms that take into account additional types of dependencies and constraints. The idea behind domain-key normal form is to specify, (theoretically, at least) the "ultimate normal form" that takes into account all possible dependencies and constraints. A relation is said to be in DKNF if all constraints and dependencies that should hold on the relation can be enforced simply by enforcing the domain constraints and the key constraints specified on the relation.

For a relation in DKNF, it becomes very straightforward to enforce the constraints by simply checking that each attribute value in a tuple is of the appropriate domain and that every key constraint on the relation is enforced. However, it seems unlikely that complex constraints can be included in a DKNF relation; hence, its practical utility is limited.

Notes:

1. Most of the modern DBMS systems offer "some kind" of DKNF normalization, by giving the designer the opportunity to assign domain and specific properties (such as key specification) to each attribute of a relation part of a schema. However, that's NOT a guarantee that the resulted relation is in DKNF.

2. The relationship between the 7 levels of normalization (1 through 5, plus BCNF and DKNF) can intuitively be represented as layered, concentric circles, with the largest circle as the 1NF, then a smaller, inside circle as the 2NF, and so on, the smallest circle being the DKNF. This is because a relation is defined as being in a given normal form (let's say 2NF) if and only if it is already in the immediately previous normal form (i.e., 1NF) and satisfies additional requirements.

REFERENCES

Elmasri, R., & Navathe, S. (1994). Fundamentals of Database Systems. 2nd ed. Redwood City, CA: The Benjamin/Cummings Publishing Co. pp. 143 – 144, 401, 407 – 409, 435, 438, 440, 442 - 443.

¹Rettig, Marc. (1995). Database Programming & Design. Miller Freeman, Inc.

²Date, C. J. (1990). An Introduction to Database Systems. 5th ed. Volume I. Reading, MA: Addison-Wesley Publishing Company.