Data Integrity

Introduction

When working with databases, there are several concepts related to data accuracy.
Collectively these are referred to as Data Integrity.

Data Integrity means that the data is accurate and available. Potential problems related to Data Integrity are:

  • inaccurate data
  • incorrect data
  • missing data
  • duplicate data

data integrity

Inaccurate and incorrect data are basically the same thing, although we may look at the cases a bit differently.

Here is an example of each one:

  • Inaccurate Data
    The customer record for Dan Smith says he has been a customer for 5 years. Actually, Dan has only been a customer for 2 years. But he now receives a 10% discount on all purchases, because he is incorecctly given a 5-year loyalty discount – which means you lose money every time he buys something in your store.
  • Incorrect Data
    The customer records for Sally Jones says her email is “s jones@someco.com” – a typo inserted a space in the name, which makes it wrong. So, Sally does not receive any emails and did not receive the invoice you sent her for $500.00.
  • Missing Data
    Customer Wendy Williams just purchased a deluxe trip package online for $2,000. But somehow, after she received her vouchers to print, the payment process encountered an error: no invoice was created and no payment was received (so she has a voucher and you don't have the money). Because the glitch did not generate any error message or log record, there is not even a record that Wendy made a purchase (except she has her valid vouchers).
  • Duplicate Data
    This seems like it might be more of an inconvenience, rather than a problem. Except, when your automated system recently ordered new printer ink, it created ten purchase order records instead of one. So, the supplier is shipping and billing your company for 10 times more than you wanted.

These examples illustrate that Data Integrity is very important, and that data integrity problems are serious.

Referential Integrity

Referential Integrity is part of Data Integrity. It is concerned with the relationships between tables. If those relationships are not correct, or the links between records are not correct, then bad data is the result.

There are two main aspects of referential integrity:

  • It should not be possible to set a foreign key(FK) to point to a primary key (PK) that does not exist. In other words, any FK should only reference a valid PK.
  • An associated rule is that a PK cannot be removed (deleted) if one or more FK’s are still pointing to it (that would result in a violation of rule 1 – an FK would be referencing an invalid PK).

Notice that these rules apply to the case when an FK is set or updated. Some FK’s may not be required, which means they are not required to contain a value at all. In other cases, an FK may be required.

Examples

Here are two examples of when an FK might be required and not required:

  • Required FK
    There are two related tables: customers and invoices. Every invoice must be assigned to a customer, so the customerid FK in the invoices table is required. (If not, it would be possible to have an invoice without a customer.) Referential integrity demands that the customerid FK must reference a valid customer.
  • Non-required FK
    There are two related tables: pets and petimages. The pets table contains an FK for petimageid. This is not a required FK (it may be null) because not every pet has an image on file. In this case, the petimageid is not required. Referential integrity demands that IF the petimageid value is set, it must contain a valid petimageid and point at a valid record.

The following graphic illustrates several data integrity problems.

referential integrity

Database Design to Improve Data Integrity

It is not possible to completely eliminate data integrity problems. The possibility of missing and inaccurate data is always present (due to data entry errors, for example). But there are ways to improve data integrity.

Much of the responsibility for good data falls on the application that feeds the database. Data validation and other error-checking methods on the front-end and business layer of an application go a long way towards improving the quality of data. Data integrity is a fail-safe to catch problems that are not caught before it gets to the database.

On the database side, data integrity can be improved in three important ways:

  • Using a Foreign Key constraint on an FK column forces the database to enforce referential integrity on that relationship
  • Using other types of column contraints can prevent unwanted data values from being inserted or updated into individual columns
  • Using table-level contraints can prevent duplicate records from being created

Creating these constraints will be explored in another article.

Conclusion

Data Integrity is an important aspect of creating and maintaining a database. Bad data can be a serious problem and difficult to control without the right types of constraints and relationships in place.