I recently completed the Databases for Data Scientists specialization on Coursera from the University of Colorado Boulder. Most of the three courses in the specialization were a review for me, but the Relational Database Design course covered normalization theory in greater detail than I had previously seen.
Although I was familiar with entity relationship diagrams and the benefits of normalization as far as minimizing data redundancy, I had not studied the more formal normalization process and the concept of functional dependencies.
Functional Dependencies
In a relation (a table), an attribute (a column) such as “date_of_birth” would always determine a column “zodiac_sign”. In other words, “zodiac_sign” is functionally dependent on “date_of_birth”. Knowing these kinds of relationships between attributes forms the basis for which attributes belong in a relation and which should be linked together with foreign keys.
Normal Forms
Previously, I’ve focused primarily on minimizing data redundancy when determining what attributes belong in the same relation. Having a more formal system of first, second, and third normal forms adds structure to this process.
In particular, the course introduced:
- partial functional dependencies—and how to remove them to restructure a table from first to second normal form.
- transitive functional dependencies—and how to remove them in order to restructure a table from second to third normal form.
Speeches Example
In a previous project, I created a normalized PostgreSQL database from a CSV file of India’s Independence Day speeches. Spurred on by what I learned in the PostgreSQL for Everybody specialization, my goal was to store every string only once and link tables together with integer keys.
Now armed with the terminology of relational database theory, I would not change the database’s structure, but I can provide new justifications for why the design is “suitable”.
For every relation, the primary key determines all of the other attributes. For example, the corpus
relation, can be represented as:
corpus (year (pk), pm_id(fk), party_id(fk), title_id(fk), footnote_id(fk), source_id(fk), url_id(fk), text_id(fk))
It has only one functional dependency, which could be written as:
year -> pm_id, party_id, title_id, footnote_id, source_id, url_id, text_id
Of course, this is an overly simple example, but the principles can be useful for working out more complicated situations.