150 Chapter 11 / Identity
IDEF1X uses the AKn.m annotation to indicate unique combinations of fields (alternate
keys). The “n” is the number of the alternate key. The “m” is the sequence of fields within
the alternate key. Thus in Figure 11.1 the airlineName and airlineCode each individually de-
note an Airline. The combination of airlineID and accountNumber is also unique for a Fre-
quentFlyerAccount.
11.1.2 UML Qualifiers
UML qualifiers are also helpful for specifying intrinsic identity. A qualifier is an attribute
that distinguishes among the entities at a “many” relationship end. A qualifier selects among
the entities, reducing the effective multiplicity, often from “many” to “one”. For example, in
Figure 11.1 the UML model and IDEF1X models have the same meaning. They specify that
an Airline plus an accountNumber yield at most one FrequentFlyerAccount. In contrast, if
the accountNumber is omitted, an Airline yields many FrequentFlyerAccounts. A qualifier
specifies an important path for traversing a model and finding data.
11.1.3 Logical Horizon
Traversal is a third aspect of intrinsic identity. Applications can navigate a model and its cor-
responding database by traversing relationships and generalizations. Traversals of a model
become SQL joins in an implementation.
The logical horizon [Feldman-1986] of an entity type is the set of entity types reachable
by one or more paths terminating in a cumulative multiplicity of one. A path is a sequence
of traversals of relationships and generalization levels. The purpose of the logical horizon is
to compute the specific entities that can be inferred from a starting entity.
First I will explain Figure 11.2 and then give some examples for logical horizon. A Pe r-
son may have multiple FrequentFlyerAccounts, each of which is offered by an Airline. An
accountNumber is unique within the context of an Airline. Each FrequentFlyerAccount can
have Activities posted of various ActivityTypes. For example, a frequent flyer account can
have credits posted for actual flight miles, bonus miles, dining activity, and car rentals.
Figure 11.1 Intrinsic identity in a model. There are unique combinations of
fields that provide starting points for finding data in a database.
Airline
name {unique}
FrequentFlyerAccount
1
accountNumber
0 1
lastActivityDate
/currentBalance
IDEF1X model
UML model
airlineID
Airline
airlineName (AK1.1)
frequentFlyerAccountID
FrequentFlyerAccount
lastActivityDate
currentBalance
airlineID (FK) (AK1.1)
accountNumber (AK1.2)
code {unique}
airlineCode (AK2.1)
11.2 Names 151
The logical horizon of Person is null; the only relationship is to FrequentFlyerAccount
and the multiplicity is “many.” The logical horizon of FrequentFlyerAccount is Pers on and
Airline. The logical horizon of Activity is FrequentFlyerAccount, Person, Airline, and Activ-
ityType.
11.2 Names
Names are prominent in models and often helpful for finding entities. Webster’s dictionary
defines a name as “a word or phrase that constitutes the distinctive designation of a person
or thing.” There are four scenarios for how names can be used to find entities.
• Unique names. Some names are unique and resolve to a single entity. For example, the
name of a country corresponds to a specific country. In Figure 11.2 Airline name, Air-
line code, and ActivityType name are globally unique.
Figure 11.2 Names. Names are often helpful for finding entities.
Airline
name {unique}
FrequentFlyerAccount
1
Person
name
1
*
accountNumber
0 1
lastActivityDate
/currentBalance
Activity
amount
date
*
1
ActivityType
name {unique}
*
1
IDEF1X model
UML model
airlineID
Airline
airlineName (AK1.1)
personID
Person
frequentFlyerAccountID
FrequentFlyerAccount
lastActivityDate
currentBalance
airlineID (FK) (AK1.1)
accountNumber (AK1.2)
personID (FK)
personName
activityID
Activity
amount
date
frequentFlyerAccountID (FK)
activityTypeID (FK)
activityTypeID
ActivityType
activityTypeName (AK1.1)
code {unique}
airlineCode (AK2.1)
152 Chapter 11 / Identity
• Unique names within a context. Other names are not unique on their own but are
unique when combined with a parent entity. For example, the names of provinces are
unique within the context of a country. UML qualifiers (and their equivalent in IDEF1X
alternate keys) provide a notation for specifying fields that are unique within a context.
In Figure 11.2 accountNumber provides the unique name for a FrequentFlyerAccount
within the context of an Airline.
• Non-unique names. Still other names provide important description but alone cannot
find an entity. For example, person names are important, but insufficient for finding an
individual person. Sometimes non-unique names can be augmented with additional de-
tails to find a specific entity.
• Multiple unique names. Some entities have multiple names. Figure 11.3 promotes the
substanceName attribute to an entity type because each chemical substance may have
multiple aliases. For example, propylene is known as propylene and C
3
H
6
.
11.3 Surrogate Identity
Sometimes entities can be identified via other entities with which they are closely related.
For example, you cannot reliably identify a person with his or her name. However, in some
applications you can identify a person via a passport, driver’s license, or identity card.
11.4 Structured Fields
Some entities, especially mechanical parts and items for commerce, have structured fields
that provide identity. A structured field is a field that is composed from constituent pieces
with a specified grammar. Structured fields are synthetic but when parsed the pieces have
meaning. Many structured fields are backed by standard protocols.
As an example, consider the UPC, EAN, and GTIN codes for consumer packaged
goods. The UPC (Universal Product Code) originated in the United States in the early 1970s
SubstanceName
substanceName
alias
Substance
1 *
Figure 11.3 Multiple unique names. Chemical substances can have
multiple names that identify the substance.
1
IDEF1X model
UML model
substanceID
Substance
substanceNameID
SubstanceName
substanceName
substanceID (FK)
11.5 Master Applications 153
and has twelve digits. The first digit is the system digit and indicates the kind of item—gen-
eral merchandise, random-weight item, health item, in-house item, and coupon. The next
five digits denotes the vendor. Digits seven through eleven are called the item code—a
unique code for an item within the context of a vendor. The last digit is a check digit.
The UPC was followed by the EAN (European Article Number) which has thirteen dig-
its. The EAN has the following sequence: two system digits, five vendor digits, five item dig-
its, and one check digit.
Most recently the UPC and EAN have both been replaced by the GTIN (Global Trade
Item Number). The GTIN has fourteen digits with the following sequence: item digit, two
system digits, five vendor digits, five item digits, and one check digit. Note that there are a
total of six item digits, one at the start of the sequence and the other five later in the sequence.
The extra item digit is often used to distinguish individual items from multiple items that are
packaged into cases and pallets.
Many manufacturers have product codes that embody information about the major com-
ponents, location of manufacture, and date of manufacture.
11.5 Master Applications
Some organizations have dedicated applications (called master applications) that enforce
the identity of occurrences of crucial concepts and unify their data across an organization’s
applications. All requests for creating, modifying, and deleting such occurrences must be co-
ordinated via the corresponding master application. For example, customers, parts, loca-
tions, securities, and contracts are often good candidates for master applications. An
organization should have no more than a handful of master applications and the subject con-
cepts should pervade the organization.
Consider a Customer master application. If each application names customers, there
could be many variations, such as AT&T, A.T.&T., and American Telephone and Telegraph.
One application might have the current mailing address and another might have a stale ad-
dress from the past. With this kind of chaos, it is difficult to have a global perspective for
data. In contrast, a Customer master application would dictate the precise name for a com-
pany for use throughout an organization and serve as a centralized source for its data.
11.6 Merging Data
Duplicate copies of data can arise in various ways, such as through flaws in business pro-
cesses. For example, an airline may create two frequent flyer accounts for a person and not
realize that they are dealing with the same customer. Duplicate copies can also arise through
business acquisitions that cause databases to be merged. For example, the merger of Delta
Airlines and Northwest Airlines created a need to merge frequent flyer databases.
The tricky part of merging data is determining if one entity is the same thing as another.
The entities being compared must have matching real-world data that can identify them or
there must be an algorithm to deduce a match (such as by equating the various AT&T names
154 Chapter 11 / Identity
in Section 11.5). Sometimes the information required to make the determination is missing
from a database; then a person has to investigate to reach a decision. Once a match is estab-
lished the mechanics of combination are straightforward but can still be tedious to accom-
plish without disrupting a business.
There is no merge technique that is clearly superior. The appropriate solution depends
on the situation as Table 11.1 explains.
11.7 Chapter Summary
Identity is the property that distinguishes an entity from all others. In concept, an entity has
intrinsic identity apart from how the entity may happen to be implemented. Candidate keys
Merge
approach
Definition Advantages Drawbacks
Combine
entities
Copy data from one
entity to the other.
Then discard the
rejected entity.
• Fully consoli-
dates the data.
• There can be much data to
migrate, especially if there
are many foreign keys.
• May have to remap foreign
keys and revise foreign key
definitions.
• It is easy to overlook a for-
eign key, causing a stale ref-
erence.
Favor an
entity
Mark one entity as
deprecated and the
other as active. Both
entities remain in the
database. Gradually
migrate data to the
active entity.
• Eventually con-
solidates the data.
• There is less risk
of stale references
as obsolete data
can persist for
awhile.
• In the meantime there is
database clutter, possible
confusion, and indirection
in accessing data.
• There can be much data to
migrate, especially if there
are many foreign keys.
Logical
merge
Use a binding table to
logically combine
entities. (See Section
10.1.) One entity
could be favored for
new data.
• Can readily merge
data as well as
unmerge.
• There is no risk of
stale references.
• Causes indirection in
accessing data, complicat-
ing queries and slowing per-
formance.
• The database schema
remains fragmented by mul-
tiple aspects of an entity.
Table 11.1 Summary of Approaches to Merging Data
Note: The appropriate solution depends on the situation.