This book devotes two chapters to the database. This chapter covers the basic concepts, talking about the different ways in which the very term database is used, explaining how databases are built, and detailing what databases are about. Chapter 10 dives in a little deeper to deal with some architectural distinctions between classes of database products.
In theory, you can read this chapter, skip the next, and still make sense of the whole book. The second database chapter (Chapter 10) deals with words like relational, hierarchical, network, object-oriented, schema, entity, relationship, and design and can be pretty abstract and complex. At the same time, I hope by now that you understand that difficult-sounding terms are generally based on commonsense concepts. With a little patience and careful reading, virtually everyone should understand both chapters.
I split the database chapters not only to make it easy to skip a chapter, but also to make it easy to read one, rest, and then read the next. If you read both, I promise that future database discussions will make sense to you.
This chapter is a watershed in your journey through the continent of client/server.
In the United States, a series of major mountain ranges running north and south marks the continental divide: on one side of the ranges, water runs east to the Atlantic, and on the other side, rivers flow west to the Pacific. For many years, these ranges marked a critical transition point for travelers. Struggling through difficult, unknown territory, pioneers knew that if they made it over the divide, they were likely to finish their journey.
224
In the same way, in all the preceding technically oriented chapters, I laid the groundwork for you to start building some structure that is more than foundational.
Until now, I have talked about major components that revolve largely around hardware. Graphical user interfaces (GUls) are based on color, screens, mice, keyboards, fast display memory , and display hardware. Networks depend on communications lines or fast local buses. Servers are computers per se, dedicated to a particular task. And mainframes are highly specialized pieces of hardware designed and built to provide massive throughput.
Databases are different, based almost entirely on software. Time after time, the most important advances in database technology are based on software innovation alone, although the new software may take advantage of newer and faster hardware.
Databases define a watershed between hardware-centered technology and a world of software-based mechanisms. You are traveling from the land of enablement to the land of applications that are enabled. Although a PC may have a great GUI, may be attached to a super local-area network (LAN), may share a high-performance server, and may communicate with other LANs and with mainframes over a wide-area net- work (WAN), it is only as useful as the applications running in the environment. All the components I have described so far simply enable applications to run. Databases, meanwhile, are a class of applications. True, they enable other applications, but they are also useful even in the absence of other applications. So this chapter marks a watershed: as I talk about databases, I am taking you to the land of applications enabled by all the technology now behind us.
Neither client, server, LAN, WAN, nor mainframe defines how a computer can help an organization. Databases, built on those pieces, go a good way toward providing that kind of definition. Here, for the first time, you can start to come to grips directly with the core benefits that computers deliver to companies.
Worldwide, the database industry generates over $5 billion in revenues every year. Products such as mainframe packages, which cost hundreds of thousands of dollars per year, server-resident databases worth tens of thousands each, and PC products such as dBASE, Paradox, and FoxPro, which cost a few hundred dollars each, are all just part of the database market.
Database is not a single concept but a family of related concepts. When the time comes to build a complete database system, you need several key components to make the whole thing work. Database users tend to be familiar with the part that they use most; as a result, they think of that part as if it were the whole database.
Using the single word database to refer to many things makes it hard to get a handle on the concept. As the key parts take shape, however, you can see clearly a framework for thinking about the design of computer systems in general and client/server systems in particular.
225
What is a database? A related question: What is the best-selling database in the world? The answer, surprisingly, is 1-2-3 (or soon, perhaps, Excel). That's right, a spreadsheet. How can a spreadsheet be the best-selling database when it's not even a database? Well, recall that Lotus called the program 1-2-3 because it came with three major functions -spreadsheet, graphics, and database -- unlike VisiCalc, the original spreadsheet. The programmers who wrote 1-2-3 certainly intended that it could be used as a database.
Spreadsheets enable users to organize information easily in large tables. Spreadsheets are so easy to use that beginning users don't need training or manuals to do their jobs. Even if they don't ever use formulas, users can quickly fill in successive rows, scroll back and forth, and print the tables at will. The fact that they can start filling in rows with literally no training, make changes without opening manuals, and get the job done with no programming is good.
At one time, spreadsheets were limited in size; a table couldn't have more than a few hundred rows. Not today: A typical 386- or 486-class machine with 4MB of main memory can easily handle a spreadsheet with thousands of rows. Today users can reasonably build a table with up to 10,000 rows, representing 10,000 records. That's a pretty good-sized database for most people. Salespeople, for example, rarely deal with that many customers, churches usually have fewer members, and many product marketing managers are thrilled when they gather even half that many data points to massage.
When I was a marketing manager, I needed to analyze distribution patterns for several products and determine whether stores that sold and stocked popular products also carried less popular products. The theory: my company's most successful dealers were ignoring our newer lines. I wanted to get these successful retail locations to start promoting the newer products.
I needed to analyze about 4,000 rows of information, one row for each store, and about 25 columns per row, containing dealer name, geographical information (such as city, state, and ZIP code), and, of course, sales volumes for the last year for 20 products. Overall, my little project required manipulating a database with 4,000 rows and 25 columns for a grand total of 100,000 bits of information -- far more than anybody could work with by hand.
Initially, I set up this database on a large minicomputer. The minicomputer database required me to use a powerful but complex command language. Simple operations often took up to a half hour. And much to my frustration, the minicomputer would periodically be unavailable or very slow because of other scheduled activities. I easily could have downloaded the information into Paradox or dBASE, the leading databases of the time, but I then would have had to deal with a different command language. So, instead, I downloaded into Excel running on a Mac with 4MB of memory.
226
Suddenly, sorts took no time at all. I could specify a sort in a few keystrokes (basically without commands) and see the result 20 seconds later. Patterns quickly became obvious.
Did certain retailers sell more of one product than another? Were there dealers who sold none of the first product? (I found these dealers at the bottom of the sort.) Had those stores sold much of the second product? What dealers sold the most of the second product? Were dealers in a particular city or state? Regional variations in general?
I drew a graph comparing sales of one product with another. I added a column showing the ratio of the sum of a basket with six older products to the sum of another basket with all new lines.
Any spreadsheet user could quickly answer these questions. The operations are simple; all can be completed in 10 to 20 seconds. In fact, 95 percent of ordinary humans, given several days with their favorite spreadsheet, already know how to do everything in my list (and more). Little wonder, then, that I made Excel my database. And little wonder that spreadsheets are by far the leading databases in the world today -- and in the foreseeable future.
As databases, spreadsheets have several huge advantages. They have a highly intuitive graphical interface, are useful for many other tasks (they are the Swiss army knives of the computer world), fit onto laptop and notebook computers, and are non-threatening. Yet it doesn't feel right to say that a spreadsheet is a database, period. Databases are more than just spreadsheets.
By understanding the popularity of spreadsheets as databases, though, you can get a handle on one set of things that databases do. Databases provide users (even non-technical users) with facilities for finding, organizing, sorting, viewing, and printing simple sets of data, primarily organized as tables.
The importance of tables as a way of organizing data can't be overestimated. Nothing beats a table for simplicity as a paradigm for collections of records. But this simplicity is the biggest strength and weakness of databases organized around the table concept. As I explain later (particularly in the next chapter), widespread recognition of the table's inherent simplicity and power has led to the rise of so-called relational databases; limitations in the table's expressive power will lead to the replacement of the relational model in the '90s.
Spreadsheets can represent many kinds of information, not all of which are databases or parts of databases. So what makes a set of information a database?
Records and fields are one element of a database. Suppose that you and I own a collection of videotapes we want to organize and manage. We decide to build a database. First, we decide what our database should contain. Then we design a small form on
227
which we enter information about the videotapes. Each time we fill in a form, we create a record, describing a videotape. The record, in turn, contains fields, describing the videotape's title, length, format, subject, and author, as well as where we can find it in our collection.
At the core of any database -- any size, any kind -- are records. Records, in turn, consist of collections of fields. But by themselves, records and fields are not necessarily a database.
For example, we may keep track of our videotapes on 3" x 5" index cards, but if we're not well organized, we may enter different information on every card. One card may list a title and duration in minutes. Another may have a title and lead actor's name. Another may have a title, duration in hours, and information about the tape's format. Superficially, some may consider a pile of such index cards a database. But the first time we try to work with this "database" as a whole, the limitations of this approach quickly become obvious. How do we sort by length, for example, when some cards list no duration, some have durations in minutes, and some are shown in hours? Most important, even if we can deal with the inconsistency, we don't have a standard way to find the information about duration on the card. Knowing how to find information in a record is a particularly tricky issue in a database environment; when we solve this problem, we convert a pile of random records into an organized database.
Beyond issues of neatness and consistency, this problem is more philosophical: How do we know what a field or record means? What is the meaning of our data?
Coming to our rescue, a database provides a mechanism for interpreting, or understanding, the meaning of data. In other words, the database is more than a container for data -- it provides a context for interpreting that data. Okay, that sounds pretty abstract. What does this philosophical stuff really mean to users?
Regularity, structure, and schema are part of this picture. To transform our pile of 3" x 5" cards into the beginnings of a database, we impose regularity on the data. Every index card in our example has content and structure. The content varies from card to card; the content is the actual data (Capra, John Wayne, 35 minutes, VHS, and so on) written on each card. On the other hand, the structure describes kinds of data common to all the cards; the structure may call for each card to contain (in a particular order) title, duration, subject, format, and so on. The structure description, called a schema, goes even further, defining how to measure or locate the kinds of data; for example, the schema defines the unit of measure for the duration field (minutes or hours -- but the same for all cards) and how to recognize the title field (perhaps it's labeled "title" or maybe it's always the first field -- but again, the same for all cards). To define a database table, we combine regularity and structure. Simple.
Although most people think that a database is just a huge collection of data, a file may contain a great deal of data and not be a database. In fact, ironically, a database actually need not contain any data to be a database. A database is defined by the organization of its data (structure and schema) and the imposition of a consistent order (regularity) to that data.
228
You fill in forms all the time. Start a new job and you fill in employment forms. Apply for a passport or driver's license, order a product, reserve a book at the library, or pay a speeding ticket, and you fill in forms. Does a form have any meaning by itself? Many questions of this type have surprisingly tricky answers. (Does a falling tree make a sound if nobody is present to hear it fall?) Our question, however, has a simpler answer.
Suppose that you receive by mistake a form from the company where one of your friends works. Out of curiosity, you start to fill in the form (or perhaps it's already filled in and you merely peruse it). What is the employee number? Department code? Product family? Profitability range? Part type? Employee classification? Division name? Typically, the form contains many fields that are meaningful only to somebody working for the company. Even fields with meaningful labels can easily have different interpretations at different companies.
Take something as simple as price: does it mean dollars or francs, purchase or lease, before or after taxes, asking or selling price? A form taken out of context is literally meaningless. It cannot be interpreted until it is placed in the framework of the environment for which it was designed. The same is true for database records.
Defining databases from a user's perspective, a database is a collection of tables, or files. (Generally, the terms table and files are interchangeable, as are field and column. Refer to the next chapter for some of the subtle differences between these terms). Each table contains records that, in turn, contain fields. By organizing a collection of items, such as records with fields, we are providing a regular structure that allows a user to interpret those records. The database provides a way of organizing and, therefore, understanding the data.
Databases are also self-describing -- they are both a tool and the information the tool works with. In a lifelong search for self-awareness, many of us struggle for ways to define and describe ourselves. Perhaps it is a reach to talk about databases as being self-aware, but unlike most other parts of a computer system, a database can de- scribe itself. For example, a collection of records must have a regular structure described by a schema (literally each table and record is described by a schema, which is written out in a special language that in turn can be typed in or printed out.)
Normally, we don't see this schema because it is hidden from us by the graphical database software. That software allows us to define and manipulate the structure of the database graphically. However, on request the database can be asked to tell us all about each table and its structure. But why is this self-descriptiveness so important as the capstone of our database concept?
A big difference between an ad hoc pile of 3" x 5" index cards and a database is the ability to work with the data. A database helps users find particular records, sort the records, produce statistical summaries, and so on. The reason a database can do that
229
is that the database tools -- the sorter, the query manager, the reporting tool -- know how to interpret all the records in the table. Tell the database that you want to sort, and the database asks which column you want to sort.
Opening a table brings with it three things:
![]() | Contents of all the records |
![]() | Regular structure that allows the records to be interpreted |
![]() | Description of that structure that allows tools to understand the table even though the tools may never have seen that table before |
Ultimately, a database is both a tool and the information with which the tool works.
I've described what a database means as it applies to single tables -- simple databases, but databases nonetheless. Let me explain how this concept works with our original database, the spreadsheet. Then I look at a few other examples of simple databases to see the concept in full play.
N ot every spreadsheet is a database and, even in spreadsheets that contain database tables, often the whole spreadsheet is still not a database. Spreadsheets are very good at handling collections of irregular data -- not a database or database table. In fact, the whole point of a spreadsheet is that you can just type values anywhere. Moreover, another strength of a spreadsheet is that any cell can be a formula. If all spreadsheets were databases, every row would have to contain the same formulas, and we would lose a lot of the power of the spreadsheet itself. So, for all our talk about regularity, databases are not the world. Ah, but what about the parts of a spreadsheet that are databases?
The way that Excel, 1-2-3, and other spreadsheets accommodate databases is by providing the concept of a database range: a single spreadsheet can have any number of these ranges. Each database range is equivalent to a single database table. A database range in a spreadsheet is a set of rows, all with the same columns, and all described by names contained in the first row of the range. In other words, a database range in a spreadsheet has the key characteristics of a database as I defined it previously: uniformity, structure, and self-descriptiveness. Where else can we find database tables?
Word processors can create and maintain simple database tables. Most word processors, for example, do an excellent job with mailing lists, including powerful mail-merge capabilities. A mail merge involves two components. First, you must create a document containing the mailing list. If you look at such a document, you will find a list of records, each with a regular set of fields, and all tied together by some form of self-descriptive structure. The mailing list may contain an initial record specifying the
230
order in which fields can be found in the following records: Surname, First Name, Title, Mail Address, and so on. Alternatively, to allow more flexibility, each record may be self-descriptive. In the latter case, every record would have special tags, such as in the form <Surname>, <City>, and so on. Either way, a mailing list in a word processing document is just another database table.
Of course, industrial-strength database tables also all have these characteristics. Full-strength databases offer a combination of power and complexity when it comes to building and maintaining tables. On one hand, creating tables in the first place requires you to learn a whole series of complex commands. As a result, building a database table in an industrial-strength database environment is a daunting challenge for the average person. On the other hand, after you have created the database tables, the database provides powerful commands for changing and improving the tables over time. Commands are provided directly for modifying the structure of the database independent of the content; the structure, when established, can be printed, queried, and manipulated. You can combine tables with other tables in powerful ways, and the system can accommodate complex queries and reports in powerful ways as well.
Is this it? Not quite. The problem is that although the databases described so far have an appealingly simple conceptual structure, they're also far too simple and limited to deal with large problems.
As we've looked at them so far, databases are single tables or files. We can have many such tables, but surely we won't be satisfied working with just one table at a time. So this section describes how database tables get pulled together into complete databases.
A database is a collection of tables, each organized as a set of records, all pulled together by a defined set of relationships between the tables. Databases are powerful because they can create complex sets of relationships between tables.
To illustrate, suppose that your videotape collection grows and you start lending tapes out. Later, in a spirit of entrepreneurial adventure, you begin renting and selling tapes, too. Now a single file with a record per videotape is not enough. You need files to do the following:
![]() | Track customers |
![]() | Manage the vendors you buy tapes from; special orders that can't be lost; employee lists (boy, the company's growing); and the accounting and bookkeeping records |
All of a sudden, your simple database is starting to sound like the kind of large, complex system we find on mainframes. What happened?
At one level, mainframe databases and spreadsheet-resident databases are not all that different. All the information in a database is simply collections of records with a regular, defined structure. However, by representing the relationships between those sets
231
of records, you have more than just a table -- you have a database. What about these relationship things? They must be complicated, right? Wrong.
Suppose that a customer wants to rent a videotape for the weekend. How do you put this information in a database? First, the videotape must be represented in the database file describing all videotapes. Next, the customer has to be in another database file containing a record for each customer. If the customer is a new one, you have to create a new record for him before you can let him rent movies. As you'd expect, the customer record contains her name, address, credit card number, and so on. So far, you have two files (sometimes called tables) and no relationships.
In order to record the rental, you must create another record in a database file that contains an entry for each rental. A rental record would contain the following kinds of data:
![]() | The movie |
![]() | The customer |
![]() | When the tape went out |
![]() | When it's due back |
![]() | When it was returned |
I would like to look at this record in slightly more detail.
In a rental record, how will you identify the videotape being lent out? You could enter the actual title, but then how do you know which copy of the tape it was? How about title and copy number? Definitely a possibility, but you'd have to store a lot of information in two places. Consider the same question for customer: how do we identify the customer borrowing the tape? Use his name? What if two customers have the same name? How about address? What if he moves? Worse, what if he changes his name and moves? Or what if two customers with the same name happen to live at the same address? There must be a better way. There is.
When you create the record describing the videotape, you arrange to enter a unique identifier for the tape. For instance, every time you buy a new videotape, you assign it a unique serial number. Similarly, when a customer rents a tape for the first time, you assign him a unique customer number. Pretty standard stuff. However, you also need to identify the videotape and the customer in the rental record. So our rental record starts out with a customer ID, a videotape serial number, the date the movie went out, the due date, and so on.
Hang on a second. This means our rental record depends on our customer and videotape records. Right: the three tables are linked; there is a defined relationship between the files. Rather than duplicate customer data and videotape data in the rental record, you simply link the customer table, videotape table, and rental table together. Addi-
232
tionally, you might link these tables to an employee table indicating who checked the movie out. The videotape record could also be linked to supplier records so that you know where the tape was bought. Finally, you could also link the videotape record to purchasing records indicating how many copies of the tape were bought, when they were bought, and how much they cost. Even a video store may easily require a database with up to a hundred files and dozens of relationships linking all those files to each other in simple and complex ways.
Relationships that convert a set of unrelated files into a database truly parallel and track the operation of an organization. Designed properly, a database is the model of an organization's information.
A table describing a collection of videotapes can be used in lots of places: a home, a store, the library of a company. Similarly, a customer file can come from many places. Your store's database, however, consists of a customer file, a videotape file, another file tracking rentals, and another file tracking stores. That database actually does a pretty complete job describing the entire operation of your store. You could deduce the structure of this database from an understanding of the store's business, and conversely a trained observer could figure out quite a lot from the database about how that store runs. Because the database is such a good model of businesses, database design is centrally important to both the mainframe and client/server revolutions.
A quick message from the sponsor. Very quickly in the last page or so, I have shifted from records and 3" x 5" cards to a discussion about models of organizations, mental maps, and databases as modeling tools. Can this have any practical outcome? Absolutely. The database concept turns out to be a very powerful thinking tool for understanding a major aspect of how organizations operate. Now about 20 years old, this concept is one of the first foundation blocks of the client/server revolution.
In the '50s, permanent storage was essentially unknown. Computer programs were written on paper, transcribed onto paper tape or punched cards, and then read into the computer at the time the program was run. Any data the program worked on was also read while the program was running. Then the program would process and print the altered data on paper, punched cards, or tape for later reuse. A computer program and the data it worked with were all temporary creations.
Magnetic tape (mag tape, for short) was the first form of permanent data storage. Like audio cassettes and videotapes, computer mag tapes can be reused many times. On a mag tape, information can be read, worked on, changed, and rewritten.
However, one problem with both cassettes and videotapes is that finding selections takes a long time. Also, getting from one end of the tape to the other takes forever. Mag tapes have all the same problems. For over a decade, the computer in popular
233
culture came to be associated with the magnetic tapes used to store customer files, billing information, and so on. Ultimately, computer mag tapes were too limited in performance to make databases imaginable or practical.
In the mid '60s, when the IBM 360 product family was taking off, disk storage first became practical. The early disks were huge, slow, expensive devices -hard to take seriously today. The disks themselves were enormous iron platters, specially made heavy so that their momentum would ensure smooth spinning. Yet, in spite of weighing so much that a strong person could hardly lift them, these disks were lucky to store 10 million bytes of information. Nonetheless, even with all their limitations, disks provided a key breakthrough.
Unlike a tape, a disk is a random-access device. Information anywhere on the disk can be retrieved about as fast as information anywhere else on the disk. No more waiting forever while the tape rewinds. No more rewriting the whole tape just to make space to insert a new item near the beginning. In fact, a magnetic disk doesn't have a beginning or end: the whole disk -- or selected parts of it -- are always available to be read or written. This characteristic of being able to read and write data anywhere, anytime, is referred to as random access: the ability to access data randomly, as computer people say. The fact that magnetic disks enabled random access led to revolutionary new methods for storing data on computers.
Because there is no beginning or end to a magnetic disk, finding anyone customer record is as fast or as slow as finding any other. Furthermore, because customer records don't all have to be stored in sequence, the records can grow, shrink, change arbitrarily. But how does the program find the records on a disk? After all, they can be stored all over the disk. The answer is that the software keeps an index which lists where particular records are physically located on the disk.
So, given this random-access storage, programmers started writing applications that used indexes to keep track of records stored all over the disk. If a record changed, the application would simply write the changed record back to the disk. If the record got too big to fit into its current location on disk, the program would put it in a new location and change the index to point to the new location. Everything kept working transparently. The big, new feature? Keep the data around all the time! Yes, on-line storage.
On-line means "available on the computer at all times." Magnetic disks allowed many programs to share on-line storage. On-line storage triggered a major revolution in software design. For the first time, programmers could write applications that shared the same data.
234
Recall how the video store's database grew complex so quickly? Think about all the places the customer table could be used. Customers rent videotapes, receive mailings, place special orders, make complaints, apply for credit, change addresses, join clubs, enter contests, become the targets of special promotions, ad infinitum.
Now think about how many times you've entered your name and affiliation since buying your own personal computer. Doesn't every piece of software you install ask you to type your full name and organization? Perhaps it even refuses to start the installation process until you do so, which is annoying. Why can't all those applications use a standard place for recording and finding your name? If necessary, the application could start out by asking you to confirm this data, but at least you wouldn't have to enter it over and over and over.
Everybody has heard the story of the company president who calls an emergency meeting because the company has a problem. Vice presidents show up, each with his computer-generated reports ready to provide an analysis of the causes and solutions. As it turns out, everybody thinks the problem is something different. When the reports are compared, all generated from the same computer system, the numbers all violently disagree with each other. How does this happen? If everybody is operating with reports generated by the same computer, how can the resulting reports disagree so much? Although the data may come from the same computer, that data is not necessarily a consistent database. Unfortunately, databases often have multiple copies of data stored over and over. Each repetition creates a new opportunity for inconsistency and irate presidents. The Copernican computer revolution focused on eliminating inconsistencies in data.
In 1543, Nicolaus Copernicus proposed that, rather than the sun revolving around the earth, the earth revolved around the sun. He turned the world view inside out. How does the computer Copernican revolution apply to database and application design? Simply, the database revolutionary manifesto asks you to design your applications by focusing first not on the application itself but instead on the data that essentially defines the operation of your organization. Instead of having files revolving around individual programs, have programs revolving around files. The database becomes the sun, and the programs become the planets.
In the '60s, the world of computers revolved around applications. Systems were built around applications. The leading professionals were programmers. The professionals built the systems by developing the programs that the computers revolved around. In building each program, programmers spent much effort managing data. When a pro- gram ran, its tapes were mounted; when it was done, the tapes were taken down. Sometimes tapes may be shared by several programs. But fundamentally, programs came first, and everything else followed their needs. The problem with this approach is that it encourages the creation of many different files containing variations on the same information.
235
In the late '60s, as programmers learned about permanent random-access storage on disks, they developed a new class of software: the database management system. Programmers started to wonder why they were writing the same software to manage data over and over. Along with this thought, programmers realized that perhaps the data was actually more important than the programs that worked with it! At that point, companies began to view the database as a key corporate resource.
In fact, programmers began to realize that by focusing on data rather than on code, they could build applications that more effectively dealt with business change. Our video store will change in the way it operates over time; that's a certainty. As the business grows, laws change, new organizational concepts become popular, and change is one of the few constants. Each time the store operations change, software probably has to be rewritten. In fact, one of the driving forces behind the client/server revolution is the need to accommodate changes in business rules not only company-wide, but on a local basis as well.
Programs change; data doesn't. One of the early propositions of the database revolution, this rule has turned out to be true far more often than it's false. A customer remains a customer, no matter how much the business changes. She has a name, customer number, street address, age, color preference, and a host of other attributes, recorded in fields, and all of which tend to stay quite constant, even as the rules that drive the business change over time.
At first, databases were viewed as powerful technical tools. After a short while, though, system designers realized that the structure of the database could provide a powerful vehicle for understanding how a company was fundamentally organized. Although this idea has proven to have some fundamental limitations in practical use, a database can serve as a powerful model for the company as a whole. As a result, to design better databases, software professionals created the data model, a powerful tool in its own right.
To understand the power of the data model, recall the company president's emergency meeting in which every vice president has inconsistent reports from the very same computer system. Fixing this problem requires that group to make three observations:
1. Information is a key corporate resource.
2. An information system that makes accurate, consistent information available instantly is a competitive advantage.
3. Arriving at that point requires information to be managed differently than is
236
possible with purely manual approaches.
This section takes a look at these observations in more detail.
Data as a competitive advantage. In a national retail chain, how long does it take to find out which new styles are selling particularly well? (Assume that each store only reports sales back monthly through divisions, which, in turn, consolidate their summaries quarterly.) What is the exact value to the company of being able to spot hot sellers in one or two days and reorder in a week? For customers of a direct sales operation, how attractive is it to know that any product they order before midnight will be delivered before 10 the next morning, no matter where they live in the U.S.? How much can we save in our chain of video stores by being able to balance inventory across stores so that titles moving quickly at one store can be backed up by tapes from another store where the title is a dud? All of these examples share a common element: having up-to-date, accurate information is literally the same as money. It is worth spending quite a bit to have that information because of the competitive advantage and profit it will generate.
How databases generate money. Recall the client/server version one? The server allows a team to share information effectively without requiring any people to act as information coordinators. I even talked about the server acting as a database. Now I'm going to reactivate that concept and talk about what is required, from a design perspective, to make it all work. I'm going to take that term database and put some more substance around it. And in the process, you're going to see how tables, relationships, and models playa central role in the design of client/server systems.
Part of the reason that databases are so hard to understand is that they are so many different things. What are the three meanings for the word database?
![]() | Central repository |
![]() | Personal information retrieval and analysis tool |
![]() | Conceptual model |
In the first place, a database is a central repository and coordinator of data for companies, departments, and teams. By placing critical shared information in the care of the database, as you saw in the server chapter, you can control inventory, seminar invitations, airplane seats, and credit dollars. Many people can access, inspect, and change this data, all at the same time, ail seeing the most up-to-date version of the information, and all without running into each other. The database manages both the information and the coordinated access to it. In this context, the database is playing a very
237
active role. (In fact, talking about databases in this way, computer professionals often talk about database engines. The database engine is software that manages the basic storage functions of the database.)
In its second persona, the database provides individuals with facilities for retrieving and analyzing information -- finding, analyzing, manipulating, displaying, and printing large amounts of data. In this context, Lotus 1-2-3 is a database. Even large database systems that do a superb job in the first role still depend entirely on other retrieval and analysis tools to make data available after it's been generated. Sometimes the tool is highly personal, used by an individual to find and work with information at a workstation. At other times, the tool is part of a large application that produces reports.
Third, database refers to the conceptual model of how a company works. The success of an organization at developing a useful conceptual model for how that business operates will most of all determine the success of that organization at building a useful database.
For example, how does an investor use information to pick stocks? There are as many answers as there are investors. A variety of ratios, patterns, and indicators can be used to evaluate prospective stocks: price-to-earnings ratios, alphas, betas, liquid- ity measures, comparisons to other stocks, and so on. All of these techniques attempt to reduce the mass of details that describe a company's performance to a small set of metrics that will predict future behavior. In other words, investors use models to evaluate companies. Every investor develops a model of prospective investments in his head and then uses that model to evaluate stocks one by one. Over time, the investor makes his model increasingly sophisticated and, hopefully, more accurate.
In the same way, a database is a model, based on data, of the real organization. Understanding how to build that model appropriately is the core issue in designing effective database applications.
Understanding database design is both important and interesting, so it's time to have a go at it, starting with the video store. By the time you're done, you'll see that a small number of core principles drive the whole process. Start by thinking about customers, employees, and people.
In most organizations, customers are ubiquitous; most applications deal with customers in one way or another. As a result, it's easy to design software so that customer data is spread all through the system. What happens when a customer moves: how does his address get changed everywhere? In the worst case, the customer's address is duplicated in dozens of applications, each of which has its own files, its own customer-handling programs, and its own customer screen forms. In this case, each de-
238
partment that deals with customers has to be told about the address change and then arrange to have that address change entered separately. This expensive and error-prone process may sound awkward, but it is by far the rule, not the exception. What's the alternative?
What would happen if all information about customers were stored in only a single file, no matter how many times that information was used in different applications? This approach has three huge advantages.
It's easier to change the data. Changes to the customer's name and all other related data can be made in a single place at a single time. As soon as the changes are made, all applications would immediately see the up-to-date information.
Consistency. By storing all the customer data in a single file, you would not be able to have two different versions of the customer record in the same system at the same time. Therefore, it would not be possible for two different reports to be in conflict about customers.
Conserves disk space. Databases get very big and disk space can be very expensive. By saving customer information in only a single file, you save a lot of disk space.
In addition, because so many programs would use the one database file, you can afford to work extra hard to organize that file so that retrieval is truly fast, efficient, and flexible.
All of these benefits are related to the first idealized goal of database design: the elimination of redundant data. When customer records are stored in more than one place, all the copies after the first can be thought of as redundant. After all, a customer has only one name, one address, and one identification number. So why store that information more than once?
Cutting to the core, the most important benefit is the elimination of inconsistencies. If the same data is stored in more than one location, then no matter how careful you are in building and maintaining your applications, that data will eventually create inconsistencies. These inconsistencies are the true, fundamental cause of the conflicting answers the president gets when he tries to find out how his company is doing.
Building a data model that completely eliminates these inconsistencies has many subtle challenges, which I'll consider in the pages ahead. But the basic principle is simple: designing a database that produces consistent and meaningful answers to questions means designing a database in which each key piece of infom1ation is stored in only a single place. (However, databases often provide special facilities for duplicating critical data to provide users with exemplary performance. For example, a database may literally replicate itself so that a user can keep her own personal copy
239
on a notebook computer she carries around. As was discussed in the previous two chapters, even this type of replication is now becoming pretty standard. )
The principle of storing data in only one place can drive the design of large databases. For example, in our video store, the principle can be applied to customers, products, rental records, payments, employees, and every other aspect of the business. In fact, if a database designer refined a design until each data element was stored in a single file only, he would produce a database that probably would be consistent over time. The benefits of this approach reach far beyond single tables.
The database can be viewed as an interconnected whole. To track a videotape rental, the customer record is related to a rental record, which in turn, is related to an inventory record keeping track of that particular tape. Also, the customer record is related to a credit rating record, which in turn is related to various payment records that can be used to compute late payments.
The common element here is the concept of relationships. A relationship creates a link between two records so that, for the purposes of a particular application, the two records function as one big record.
In each case, the relationship serves two dual purposes:
![]() | To allow information to be stored one time only, even though it is used
over and over in different combinations. The customer record, the videotape
description, and the like fall into this category. Simplistically, you could
store each of those records over and over each time you need the
information. But for all the reasons I've already explained, storing the
information once and then establishing relationships among records leads to
a better design. |
![]() | After a record is stored in a single place, it can be used by many applications. For example, the customer record can be related to rental records, payment records, credit ratings, direct mail lists, club memberships, and so on. |
The idea of relating tables to each other is incredibly powerful. For the first time, relationships among tables provide us with a tool for building a model representing all the information an organization works with. This model can provide a road map for finding this information and can also serve as a basis for ensuring that the information is stored in a consistent fashion. How do we build a model, gain an understanding of what all those relationships can look like, and do it right? It comes back to eliminating redundancy.
Data models allowed managers and computer professionals to start thinking about how the information and organization fitted together. Developing a design for the organization's overall database turned out to mean developing a model, a road map, for the organization's information needs. The designer who could describe a coherent scheme for organizing all the data inside computers could also describe a coherent
240
scheme for answering most questions managers would ever ask about the health of the organization.
Suddenly, database design becomes far from a technical proposition. What's one of the toughest challenges facing any senior manager? Knowing how her organization is really doing! Problems discovered long enough in advance can almost always be solved. The challenge in most organizations is usually not solving problems per se, but finding out about those problems long enough in advance to have time to solve them. And invariably, when a company runs out of time, it was because they were either asking the wrong questions or getting the wrong answers. Now databases enable companies to ask useful questions far enough in advance to solve emerging problems.
In the early '70s, large organizations reacted slowly to new situations because they had no way of processing information quickly. By definition, large companies have thousands of employees. Each of those employees deals with quite a lot of information every day. However, it was tough to collect that information in a form that can be used by others who were often separated from one another by time, space, and organizational structure.
One obvious example involves slow-moving sales information; companies don't react quickly enough when a product sells exceptionally well or exceptionally poorly. As a less obvious example, why does the customer have to build a new credit rating each time he opens an account with a different part of the same company? The company spends money tracking payment histories multiple times and the customer gets irritated. In the meantime, the company loses an opportunity to encourage the customer to build on his reputation in one part of the company by buying other products and services in another part of the company. Yet, how often have you been told "This is the sales (or service) division, and we have different systems, so you'll have to start over"?
All of these problems represent two core assumptions, which databases and networks neatly reverse:
241
Here's what changes in the new world:
Before continuing, go back and reread the last six or so paragraphs with the following question in mind: how much of what is being described is fundamentally technical and how much is really a new way of thinking about organizations in general?
When I introduced the server, you saw that by having a tireless, effectively instantaneous, and constantly accessible information coordinator, you could change the fundamental operation of a workgroup or team. Suddenly, product shipments could be confirmed or seminar attendance booked, all instantly, and all without requiring a permanent human coordinator. What is now becoming obvious is that the same principle applied on a large scale can change the fundamental operation of entire organizations of any size.
How large a central staff would it take to track credit card transactions on a global scale, assuming that it could even be done at all with a totally manual system? In fact, imagine a world without telephones, telegraphs, or computers; would a modem credit card system even be possible? The entire basis of the system is that banks, which are essentially lending money each time a credit card is used, can track the exact, up-to- the-minute spending of each individual credit card holder and know immediately every time he makes a purchase and whether he is over his limit. The database, coupled to the network, is what makes this possible.
The challenge of building a database, though, goes even further than that. A credit card database, although large, is still relatively simple because it deals with essentially a single application. The dream, expressed clearly in the early '70s, went further: every organization, big or small, could organize all its data so that all data would always
242
be up-to-date, and all data could be related to all other data when needed.
Imagine that you're starting a new division in an existing company. Perhaps our videotape store is branching out to computer games. Historically, even though you're part of a single company, because you're starting a new division, you would start from scratch with your own computer, your own applications, your own customer list, and so on. However, your company designed its database in the way we've been talking about. As a result, the first thing you find is that the customer list is available to you, not just as a set of mailing-list labels, but in more powerful ways. In building your business plan, you combine data in interesting new ways, considering information, never used extensively before, about video customers' families and the number of children they have. This information is critical in a game-oriented business. As you start to conduct market research, learning more about those family structures, the data you collect becomes available to others as well, through the database, and as a result the marketing department starts special mailings focusing on children's movies. Later, as you build some special software to check out your games, you are still linked to the common customer tables so that when customers move, have privileges revoked, or enter complaints, your new system automatically is part of all that. After getting your new division off the ground, on looking back you find that, through the common database, application development proceeded more quickly, with fewer problems and more powerful results than any of your past experience would have predicted. A compelling picture if we can only do it.
Summarized here are the core principles of the database revolution:
Information as the center: Applications should be designed around the database, not the other way around. The data itself is a key corporate resource, common to all departments.
Eliminate redundancy: Information should always be stored in only a single place in the database. In that way, changes can be made only one time, conflicting copies of the same information are intrinsically impossible, and space is used in the most efficient fashion possible. When information of all kinds is stored in only a single place, all information can be linked or related to all other kinds of information, making the database a supremely flexible tool for supporting company-wide analysis and planning.
Source data capture: Information should be entered into the database as soon as it is generated. In that way, the information is in its single, nonredundant, correct place from the beginning, the information is immediately available organization-wide, and changes can automatically be reflected immediately in the (only) correct copy of the data. This means entering both new information and changes to information immediately at the source where the information first enters the organization. By implication, people who work with information should have a terminal or personal computer so that they can enter information into the computerized database as soon as they receive or handle that information.
243
The computer is the brain, the network is the nervous system, and the database is the memory of the organization. By providing everybody with a terminal, all information becomes completely up-to-date, available to all other users, and consistently organized. Easier said than done.
In an important sense, the database became a holy grail, a religious period in the development of computer systems, that is still with us today. In the early '70s, computer professionals and business people alike committed themselves to the crusade of developing complete database systems for their organizations. The dream, the vision, still swimming in front of all our eyes, is of complete information, completely available, and completely consistent corporate wide. Why should this be so hard to do?
Database as a concept continues to be important; it is one of the keys to transforming personal computers and networks from personal appliances to empowerment engines for teams and organizations. At the same time, the concept of database, as understood for the last 20 years, has turned out in many ways to be a dead end. On both the technical and conceptual level, it is important for us to understand why it's a dead end and how it fits into the new big picture that will carry us through the next 20 years.
Imagine an experiment based on time travel. In this experiment, we recruit from large organizations a sample of 100 chief information officers (CIOs) who have held their jobs for the last 20 years. (As an aside, one reason we have to imagine this experiment is that the very job we're talking about -- CIO -- is very risky and that the average tenure of a CIO today is only about 18 months; so finding l00 such people would be tough, even if time travel weren't a problem.) Having found our 100 MIS chiefs, we interview them every 10 years. Over 20 years, this means three sets of interviews: one at the beginning, one after 10 years, and a final interview after 20 years. In our interview, we concentrate primarily on one topic: database. We ask how they feel about the concept, what they're doing to implement one, and how they feel about their progress.
Before starting the interviews, be clear about the fact that I'm talking about ideals here. In the early '70s, many companies were convinced that database was about database technology. If the company bought a database, an expensive proposition because the software cost a lot, then by definition they had obviously implemented database. Not so, the idealists would say. By definition, any company with more than one database, in fact, had no databases! The whole point, as we've already seen, is to
244
store all information in a single database, with no information stored more than once, and with the help of the database technology, to create an environment where all information is consistent and nonredundant and can be combined with all other information. So it's that ideal that we're going to explore with our 100 CIOs.
1970: Everybody is excited. Harvard Business Review just conceptually described the database. Finally, a computer topic that senior management can get really excited about. After 20 years of constantly increasing computer expenditures, management is starting to question how long the organization could keep spending more on hardware, software, and application development. And suddenly, because database implies competitive advantage, it becomes critical to implement a corporate database, including the network required for source data capture. Best of all, management and users are pushing for aggressive use of the technology instead of simply focusing on potential cost reductions.
Ask any CIO in 1970 or in the first half of that decade and there would have been no question that implementing a corporate database was on the very top of her priority list. Tough? Yes. But nobody would have questioned the ability to complete the task within about ten years. And after the task was done, all CIOs would have insisted, that pesky problem of the darn president getting conflicting answers from different vice presidents would finally go away. A pesky problem, but a nice litmus test too.
1980: Our 100 CIOs are a little more subdued, particularly when it comes to database. Not excited, but perhaps quietly optimistic. After ten years, no large organization has a corporate database in the sense I've described. Everybody still has data stored in lots of different places. Presidents still receive conflicting answers all the time. And combining data from different parts of the company in a way that produces meaningful answers is generally still as hard as ever. Furthermore, even in the more limited ways that data- bases have been implemented, the raw technology has created many challenges. The software is expensive, and programmers do have trouble learning how to use it. And worst of all, data in a database is not really any easier for users to get at than all that old data that was in separate files. So why any optimism at all?
First of all, everybody -- technologists, managers, users, and consultants -- all agree that database, the concept and vision, is more important than ever. Ten years of increasing computerization has just demonstrated how valuable information can be when it's up-to-date and available. And the more information that's stored in the computer, the more information kept on-line, the more important organizing that information in a flexible and useable fashion becomes. So management, although less idealistic, is even more anxious, and perhaps patiently prepared to do what it takes to end up with a usable database rather than a collection of computer files. So in 1980, CIOs are contrite but determined.
245
Second, having spent the better part of ten years chasing this dream, a plan appears to be in sight. That decade of experience has generated a lot of wisdom. Database technology has matured, and newer approaches, particularly relational databases (discussed in the next chapter) seem to offer a lot of hope. Finally, the importance of database design -- the very topic we've already spent so much time on -- has been recognized. New techniques for representing the design of a database -- which we'll look at next -- have been developed. Software tools have even been proposed to help database designers in their difficult task -- CASE (computer-aided software engineering) -- so maybe the computer can even help in the very task of figuring out how to use it better. Put all these pieces together, and CIOs feel somewhat humbled by their task and quietly optimistic that the next decade is the one in which their corporate databases will finally be built. Good thing too, because there's really not much choice if we're to both use computers effectively and manage our organizations well. In fact, the one scary thought in all of this is what if somebody else gets there first?
1990: An uncomfortable place to be -- our interview room. Nobody wants to talk about database, not really. Has anybody built the corporate database we described? If you're talking about large organizations, in 1990, or even 1995, nobody has yet built a database that is completely consistent, duplicates no data, and answers all questions. Presidents still get conflicting answers every day in virtually every company of any size.
Perhaps building such a database is no longer important? Do our CIOs rank building the corporate database as one of their top ten priorities? Interesting question, but the answers are even more interesting. Nobody wants to admit that it's a top ten priority. Yet, nobody wants to say it's not! What do we get in putting together our list of top ten priorities? Nine straightforward goals and one goal that's not not a top ten priority. That's right, a double negative: a priority so important that none of our CIOs will leave it off the list, and yet a priority which elicits so much frustration that not one of the hundred will put it on the list. How can this be?
In most areas of computer technology, progress is constant and rapid, and, over time, almost all tricky problems seem to solve themselves if they're just left alone for long enough. Not so, apparently, in the case of database. Here's a case where we've had a common dream for over two decades and spent large amounts of time and money, with regular infusions of new religious icons to help light the way, and yet we're still just as stubbornly as far as ever from having unified, consistent, useful corporate data- bases. Is there an explanation?
It turns out that the answer is yes. There are three reasons that building our vision of the corporate database has proven so difficult:
![]() | Data modeling is hard: As I discuss in the next section, building a complete data model that represents all the information used by an entire large corporation is very hard, maybe even impossible. |
246
![]() | Performance is a real problem: Paradoxically, the very
things that are most appropriate for designing databases to be consistent,
flexible, and accurate are the very things that make those same databases
complex and slow to process. As a result, once MIS professionals understood
the very techniques that would best lead to highly consistent database,
those same professionals were forced to recommend against using those
techniques because of the performance problems they create. |
![]() | The basic model is wrong: One of the main tenets of the
database crusade was the creation of a single and therefore central
database for the entire organization. The idea that it was a net plus to
have a single database turns out to be exactly the problem. Both from
a perspective of design feasibility and from a perspective of performance
feasibility, what we want is not one database but many. And then we want
a mechanism for integrating those databases, tying them together, and
providing senior management with a corporate-wide view. Even more to the
point though, we need something more than just the database at the center of
our world to pull all the pieces together. We need a second Copernican
revolution -- a new concept that builds on top of databases and provides a
framework for building large, distributed systems that function in an
integrated fashion. |
![]() | Cooperating components: As you'll see later in the book, that
second revolution revolves around the concept of cooperating components. Large
systems need to be decomposed into smaller, manageable systems called components.
Each of these components, or sometimes sets of components, will have
databases at their center. But the system as a whole will no longer have a
single, large monolithic database at the center. |
![]() | Two levels of databases are needed: If there is no single central database, how does senior management get an overall perspective on the business? It turns out that providing that corporate-wide view does not mean that every last piece of operational data has to be in the one universal repository. In the final ironic twist of fate, as we start to think about empowered employees and self-managed teams, the last thing we want is a monolithic database. Instead, what we need as a new vision is the idea of a federation of databases, each internally consistent and all tied together by yet another database part of the federation, designed to meet the needs of senior management. The federation of databases fits in perfectly with the complementary vision of designing around cooperating components. Families of components will revolve around particular databases. The components in total form the overall business system, which is built on the basis of functional specialization and cooperation. And the databases managed by those various components form the data |
247
federation that feeds the higher level database that is watching the business as a whole.
Finally, we're in a position to articulate a database vision for the '90s. In the '60s, there was no concept of database; the application was king. In the '70s and '80s, after realizing the central importance of data, building that corporate database became the central unifying vision for every organization big and small, at least when it came to computers. How did the concept of database become translated into the concept of central database?
A database serves two major purposes:
![]() | The database supports the operation of the company: it tracks orders,
inventory I customer addresses, payments, invoices, shipments, and all the
other millions of operational records that define the nervous system of the
organization. |
![]() | The database provides a model of the organization's health and functioning that supports the analysis and decision making that defines much of management. |
The combination of these two functions creates the apparent need for a single central database.
How closely should these two functions be linked in the structure of the database?
One function, what we called the operational function, can also be defined in terms of transactional applications that operate with realtime data. That is, when checking the status of a shipment, we want to know where that shipment is now in realtime. The other function, the one that supports analysis and decision making, is often called decision support. One key aspect of decision support applications is that they do not operate with realtime data. Suppose that you're doing a sensitivity analysis to determine whether to raise the price on a new product line. Often this type of analysis is called what if because the question being answered is generally of the form what if... An intrinsic characteristic of what if-analysis is that the same question is asked several times, with a single factor (in this case, the price) being changed; then the results, perhaps sales volume, are compared to see if a higher price led to lower sales. Here's the catch: suppose that you used realtime data in your what-if analysis. The
248
data would keep changing underneath you all the time. Obviously, this would invalidate the whole point of your decision support system. So decision support applications, as compared to operational applications, not only don't work with realtime data, they also can't work with realtime data.
Historically, an important part of the database dream was the idea that analytic and transactional databases should be linked and should be one and the same. The idea behind this goal is simple, even if it's wrong. Analytic databases support managers and decision makers of all types. At the top, ideally, the president, asking that infamous question of her vice presidents, receives consistent answers and then makes decisions that affect the organization for years to come. And, of course, if the decisions are that important, they should be based on the best possible information. Without strong reasons to the contrary, it's only natural to assume that the freshest, the most up-to-date, the most realtime information is the best. The very term realtime implies that anything else is non-realtime and who would want information based on time that is somehow not real? So from the beginning of the database crusade, an important holy grail has been the construction of decision support systems, supporting the analytic needs of senior management, built in a way where the information in those decision support systems is based on constantly refreshed, always up-to-date information. To put this in particularly sharp focus, one of the metaphors frequently used to describe the resulting system has the chief executive of an organization sitting in an office that has a dashboard, just like the dashboard in a car, and by picking the right set of key indicators to display on that dashboard, he's able to run the company in a completely dynamic fashion.
What an appealing vision. Really, no sarcasm intended or implied. Picture the huge manufacturing company with factories world-wide and a sales force in every country. Nonetheless, the president of the company, in considering which products to build over the next ten years, is able to keep every sale, and every complaint, and every cost in mind all the time. Every decision is based on every piece of data known company wide, right up to the very second the decision is made. To use a biblical analogy: no sparrow falls without the senior decision maker knowing and taking that fall into account in his or her planning. Neat, but is it what we want, even if it could be built? The answer is no, it's the opposite of what we want, and the answer revolves around the role of time in running an organization.
Why does it even matter whether analytic and operational databases are the same or linked? Because the primary reason for wanting to have all the organization's data in a single large database is precisely to make this linkage possible. If you believe that eventually all decisions made by the chief executive should be related to the most up-to-date possible version of every piece of data company-wide, then the only way to do this is to have a single, consistent database. Even if the database is somehow geographically distributed, from a design perspective, the only way to link the executive's dashboard control panel to totally realtime operational data is to have all the parts of the database, company-wide, linked to each other. And it is this requirement that sets up those impossible design requirements. The conceptually impossible task of ratio-
249
nalizing the entire company-wide information structure and the operationally impossible task of building a computer system powerful enough to link all the information at one time both stem from this single ultimate need. If it turns out that this requirement is not only inappropriate, but even more ironically, the opposite of what we want, we can revisit the whole issue of how databases are built (and perhaps create a situation where CIOs will put database back on their top ten list in 2000 and even report their first successes). So let's see how operational and analytic databases compare.
Table 9-1 compares some of the characteristics of the two types of databases. First, a transactional database by definition deals with up-to-the-minute, even up-to-the-second data. Did the order ship? Is The African Queen still in, or did somebody rent it out? Did you receive my check? How much do we owe you? All these questions are meaningful only if the answers are based on completely up-to-date information. If you are told that The African Queen is still available, but it was rented out just a minute ago, what kind of a customer happiness situation does that create? Of course at some level a database can never be totally up-to-date. If a customer takes a videotape off the shelves, intending to rent it, the database can capture the rental only when the tape shows up at the checkout stand. We can't solve that problem, but we can guarantee that the instant the tape is registered, even before the customer's other tapes are entered, before he leaves the store, that tape is marked as unavailable. That way, operational decisions about availability, customer status, and so on, are made based on up-to-the-minute data.
Table 9-1. Comparing the Two Types of Databases
Transactional |
Analytic |
Up-to-the-minute Constantly changing Detailed Specific Local Recent |
Up to period end |
Analytic data is never up-to-date. Up-to-date in a management context means up to the last month, the last quarter, or the last year. Sometimes in a sales management environment up-to-date may mean up to the week or even up to the day, but never up to the minute. Why? By definition, analytic data is used to analyze patterns. Patterns virtually always involve comparisons, and those comparisons are always based
250
on periods of time -- a month, a quarter, a year, a week, or even a day. Focusing on unchanging periods of time allows comparisons to be made in a consistent fashion. Who is the most productive sales rep in each region? Compare sales figures to see who sold the most that month.
Imagine if you used up-to-date data to make the comparison. The sales rep whose sales you summed up last would always have an advantage; he or she would have a chance to have just one more sale counted in the total. Or suppose that you want to predict the effect of a price increase. You increase the price in one region but not in another and then compare sales for a quarter. Fine, once the quarter is over. Suppose that you did the analysis, though, during the middle of the quarter, using the constantly changing operational data. Suppose, in the process of doing the analysis, that you compare sales for several different product lines, looking at different customer types and different channels of distribution. Each different type of analysis involves a new report run. Because you're using live data, though, each new report run includes more and different sales and customer data than the one before. Obviously, the vari- ous reports simply can't be compared with each other, unless you wait until after the quarter is over and use the data after it's stopped changing. And, of course, that's the point: analytic databases are always based on periodic data. Up-to-date for an analytic database means last period's data. Of course, everybody wants last month's data the day after the month is over; still, once they get it, nobody wants it to change.
All of this brings us to the next row in our comparison table. Transactional data is constantly changing. That's why customers keep calling to find out if that part has become available or if their payment was finally received. Analytic data never changes; it's based on snapshots of the database. Just like a photographic snapshot captures frozen moments of time, analytic databases capture frozen cross-sections of the organization's data that remain static after they're captured. So when we slice and dice our payments data to determine whether frequent customers pay more regularly than occasional ones, no matter how many times we crawl through the database, picking records, computing sums, and doing comparisons, the data we work with will always be the same. That way, an earlier result can be compared directly and meaningfully with a later one. In a transactional database, multiple results should be different; in an analytic database, multiple results must be the same.
When a customer has a question, it's his data he cares about, and usually at a pretty detailed level. Did a particular shipment leave; is a specific tape in stock; how much, exactly will a given repair cost? When a question arises about a bill, the customer wants to discuss the charge, one line item at a time. As a result, operational databases can often be divided up on geographical, functional, and demographic lines. A ware- house needs the inventory counts for the stock it holds, the personnel records for the local drivers, and the bills of lading for shipments delivered by only those drivers. Finally, although operational data is quite detailed and specific, it is also generally quite recent. Of course, it is completely up-to-date -- that's the point -- but it also doesn't go very far back in history. For instance, a warehouse may retain shipment data for
251
30 to 60 days, but try tracking a shipment from two years ago. Generally, once the books are closed, which happens every month in most companies, transactional data can't be changed anymore, and most operational systems don't keep that data. Operational data is detailed, specific, local, and recent.
In contrast, analytic data is based on aggregates. How much did each salesperson sell last month? Which regions are growing and which are shrinking? When asking a question like this, nobody cares, or wants to know, about specific orders or line items. The information is first aggregated by region, sales rep, product, or customer. This immediately implies that the data is at least regional and generally global in scope. The whole point of an analytic database is to facilitate comparisons and this in turn implies keeping data from major parts of the company all in one place; no geographical partitioning here. In making comparisons, time is just as important a variable as space. Analytic databases frequently keep data going back months, quarters, years, and even decades. Obviously, to make it possible to deal with such large amounts of data, aggregation is critical: average sales per sales rep, growth rate by region, and so on. An analytic database is in many ways the opposite of a transactional one; the data is general, aggregated, summarized, and global, and it stretches far back into history.
Even at this level, you can see why operational and analytic requirements lead to distinctly different kinds of databases. The ideal operational database would be locally placed, tracking data by the second and providing a constantly up-to-date picture of its local world. The analytic database, on the other hand, would be at the center of the organization, providing product managers, analysts, and executives with a global picture based on world-wide data stretching back into history, replacing the realtime details of the analytic database with the more general patterns that form the true basis for accurate decisions. All of this raises the possibility that the transactional and analytic database may be separate entities. However, looking deeper shows that they must be separate.
What are the real differences between the transactional and analytic databases?
Space and time; but is it really both? Most transactions are localized in space, but it's not hard to imagine larger orders on a national basis that could just as easily not be very localized. So space is not a real differentiator. Time, though, is a very different matter.
In 1917, Albert Einstein first proposed that at high speeds, objects would experience time differently from how they would ordinarily. As a result, modeling the universe requires a variety of corrections to handle the varying time metrics affecting objects traveling at various speeds. If time-related effects are not corrected for, then our picture of the universe and all predictions based on those pictures will be wrong. Time plays just as many tricks on organizations as it does on the physical universe. And the primary operational characteristic of an analytic database is its ability to correct for
252
time-related effects.
Which advertising campaigns were the most effective at getting customers into our video store? Compare them all over the period of the past year, evaluating rental volumes in dollars against the amount spent on the ad. How do we account for changes in our rental prices while doing this? Starting March 1, weekend rentals increased from $2.50 to $3.00. If we ignore this event, any ad we ran at the end of February will look spectacularly effective; after all, revenues increased very quickly that first weekend of March. But the increase had nothing to do with our ad? True, but how do we account for that? One answer: keep the price constant.
Do price increases have an effect on rental volumes? Compare sales before and after each price increase. What happens when we find out that our system, to allow us to measure advertising effectively, no longer allows us to see price changes; how do we even ask this question?
Which department in the store is most profitable: adult movies, children's films, or the general category? Suppose that we reorganized the departments halfway through last year; does that mean we can't analyze profitability for more than six months because the data from before the reorganization can't be compared with the data from after? Perhaps before the reorganization we had six departments and now we have nine? How do we account for this?
How about seasonal variations? Normally, a store does up to half of its business in the six weeks before Christmas. When analyzing a new sales commission plan, we want to correct for this seasonality, essentially subtracting the magnifying effects of the Christmas season so that any sales increases we see are caused by the new commission plan, not by the season. At the same time, if we are doing a straight dollar forecast, we want to add seasonality back in; otherwise, our forecast will come up short.
As products flow through the sales channel, the same units are often sold multiple times: first to a national distributor, then to a regional distributor, then to a local store, and finally to the final consumer. Depending on the type of analysis we do, special provisions are required to ensure that we don't either miss sales or count sales more than once.
All of these questions revolve around the effects of time. When we're dealing with time, two factors critically affect the design of the analytic database:
![]() | Accurately dealing with time-related effects is very expensive; therefore,
we need to look for ways to build in those corrections in advance so
that the corrections don't have to be applied very time question is asked. |
![]() | There is no one correct way of dealing with time. Instead a good analytic database must provide a number of corrective models in advance, and the user must be provided with guidance so that he or she selects the world/time view that is correct for the type of problem being solved. |
253
For example, seasonality, inflation, regional variations, and price histories (price changes, adjustments) are all time-related functions. A good analytic model will maintain different world views to deal with each of these effects. Correcting for inflation is relatively expensive because an accurate adjustment will vary according to both the commodity and the market segment. However, after an adjustment technique has been chosen, the aggregates stored in the analytic database can automatically be adjusted for any combination of inflation, seasonality, regional growth rates, and so on. Similarly, the database can be set up so that one world view masks price changes, allowing comparisons to be made independent of the actual price, and another world view reflects the actual prices, as they varied through time.
The reason this is so critical is that making these adjustments turns out to be hugely expensive. For example, tracking price changes is very complex indeed. For one thing, prices generally change in a very unpredictable fashion: some product lines have stable prices, and others are forced to react to market pressure frequently. Oil prices, for example, change by the minute, while electricity is regulated and therefore changes on an annual basis. Making things worse, a price change itself is hardly a discrete event. First, the change is announced in phases; big customers find out first, then resellers, and so on. Even after a price change is announced, various preexisting conditions cause it to dribble out in stages and phases. As a result, tracking the effects of price changes accurately, as larger aggregates are computed, can require that virtually every line item of every order be processed. When this factor is combined with changes in product line composition, inflation, and other factors, the cost of rebuilding an analytic database to deal with a new model for time-related effects can be huge.
Fortunately, all of this can be dealt with very effectively in the design of an analytic database. The key concepts in planning for this design are straightforward but call for careful planning and preparation:
254
provide another perspective on the same data where price changes are reflected in the data and where the price is directly visible as an analytic variable. In the same way, virtually every time-related factor has to be factored in so that it is both factored out and factored in, depending on the perspective chosen.
4. Analyze the types of problems to be solved to pick a manageable number of perspectives: This is the real key to developing a useable analytic database: understanding how the database will really be used. In theory there are dozens of time-related variables, and users can combine those in hundreds of complex ways. In practice, a small number, often fewer than two dozen, of carefully chosen perspectives will provide users with the views they need without requiring them to even be aware of most of the time-related variables or distortions.
The correct recognition of the effect of time and the ways in which it distorts our world views as we take longer and longer perspectives on organizational histories turns out to be both a complicator and a simplifier. On the one hand, dealing with all the time-related effects forces us to design a sophisticated and elaborate analytic database to serve the needs of senior decision makers. Building such a database is very expensive. At the same time, once the need for such a database becomes clear, it also be- comes quite clear that that database must be different from the operational database. The analytic database is derived from the operational data. Of course, where else would we get our raw data? At the same time the transformations done to convert that raw data into usable form are so complex and expensive that we can't afford to do them on the fly. And even if we could afford it, the effort would be useless because by definition analytic databases are based on historical snapshots anyway, so dynamic derivation value.
When we recognize that having a separate analytic database, far from being a necessary evil, is, au contraire , actually a virtue, all of a sudden the path to having a complete and viable database strategy becomes clear.
Historically, central databases have been necessary to serve the informational needs of the organization as a whole. At the same time, users with analytic requirements have resented those central databases because they've been slow, hard to use, and inflexible. Central databases, in a real sense, are equivalent to dictatorship: a controlled environment with high consistency but no individual freedom. The introduction of personal computers and their personal database models brought with them freedom and anarchy. Product managers could suddenly build their own databases and manipulate them with total abandon, but the data they worked with was typically limited, inaccurate, and intrinsically not shareable even within their own workgroup.
255
The official response to this situation, since 1970, has been the database crusade. If we only get to finish our big central database, then everybody will have a consistent set of data to work with. From an operational perspective, the bigger organizations discovered pretty quickly that a single central database could not keep up with the load. And even in a small organization, a central database certainly can't keep up with the decision support needs of analysts, managers, and executives.
Until recently, nonetheless, nobody questioned the idea that the big central database was the eventual solution; it was just taking longer than anybody had thought possible. In the meantime, companies worldwide were dealing with the present by splitting their databases anyway. On the operational front, there are some compelling advantages to distributed processing; the issue right now is how to build such systems. Even more interesting, though, is the splitting that's been happening on the analytic front.
Almost from the beginning, most organizations have found that separating the analytic and operational databases was the better part of valor. Early operational databases such as IMS, IDMS, Total, and ADABAS were very fast and reliable, but were almost unuseable for sophisticated and flexible analysis. Later, relational databases such as DB/2, Oracle, Ingres, and Sybase were excellent at supporting flexible analysis, but just couldn't keep up with the workload associated with the operational environment. So the answer was simple: split the data into at first two, later three layers. First, e operational data in transactionally oriented databases. Even today with the incredible growth of Oracle and other relational database vendors, it is still true that over 75 percent of operational data in big companies is in the older, nonrelational databases. Second, keep an appropriately transformed copy of the data in a relational database to support the organization's analytic needs; that's how Oracle got to be a $1 billion company. And third, make it easy for individual users to tap directly into those relational databases and transform the data even more to meet their personal and team needs Although this split has existed in fact, for all 20 years of the database era, until recently it was viewed as a necessary evil. One day, so the story went, when either relational databases got fast enough, transactional databases became flexible enough, or something new came on the scene, we'd all reach the promised land: all data would be in a single database and all decisions would be based on totally up-to-the-minute data. All data would be consistent, all redundancy would be eliminated, and all CIOs would stop being depressed.
In the same historical saga, the main proponent of the central, dictatorial approach was IBM. In part, this is the reason that the company, their computers, and most of all, their databases have been referred to as Big Blue. About five years ago, though, IBM also recognized that more than a necessary evil, separation of church and state may actually be a virtue. And thus, the Information Warehouse was born. An Information Warehouse is exactly the kind of analytic database described in the previous section. The term, as usual with IBM terms, is wonderful and terrible. Wonderful because in many ways it's immediately understandable. And terrible because it makes the problem of building one sound either too trivial or too simple. Still, count small blessings:
256
announcement of the goodness of Information Warehouses made the concept of analytic databases official and common.
Analytic database, Information Warehouse, or whatever, the concept is important for much more than itself. It is important because it is the key to understanding how to make corporate databases real. It tells us how to eliminate the three killer problems that kept those CIOs so frustrated for so long:
![]() | Data modeling is hard: Building a data model for a single,
all-knowing, all-encompassing database, operational and analytic all at
once, is hard; in fact, it's impossible for big companies. Building data
models for a series of smaller databases, though, is very doable.
Operational databases can now be decoupled from each other as they are
distributed out in support of the self- managed team. Analytic databases
pull information together from the operational databases, but as we've seen,
those analytic databases need highly different data models anyway. So keep
the data model separate because you have to, and you'll have a data model
you can actually build. |
![]() | Performance is a real problem: Performance is a problem in a
centralized environment. It isn't a problem at all in a distributed
environment. Both the operational systems and the analytic systems are
distributed. Replication makes it possible to have enough departmental
analytic systems to give everybody snappy performance. Everybody can run as
many dynamic queries as they want. And because the workstations are now
supported by the shared analytic databases, users pull data off the
departmental systems to start with, but then slice and dice it in their own
machines. Shared common data and unlimited horsepower, both in one solution.
Because the analytic data is directly derived from the operational
database and because the analytical database is carefully designed with
consistency in mind, the president actually has a chance of getting common
answers to common questions, even if she allows more than one vice president
to come to the meeting. |
![]() | The basic model is wrong: The answer is neither dictatorship nor anarchy, but rather federation. The local transactional systems can be modified by the self-managed teams -- freedom -- but only as long as common business rules are not violated -- control and coordination. The individual product manager can build his own database and sort it every three minutes -- freedom -- but the underlying data comes from a departmental analytic database where views are carefully constructed to ensure consistency and correct handling of time-related distortions -- control and coordination. In this database world of the future, we have not one database, but thousands. Even the central databases, which play the key coordination role, are not really central, only linked. And all that linking leads to a new concept: federated databases. Not dictatorship, not anarchy, but federation. |
257
Database is what computers are all about. Not the only thing they're all about, but one of the things they are all about. A well-designed database is a model of the organization. The information contained in it allows a manager to assess the company's health, predict the future, and solve problems before they occur. If the network is the nervous system of an organization, the database is its memory. And, like any well-organized memory, a properly constructed database allows the organization to keep its promises: prompt service, minimized extra steps for customers and employees, and courtesy. It is no accident that politicians and salespeople develop trust by making a point of remembering birthdays, names, and promises. In the same way, an organization that can remember all the details of its interactions with people -- insiders and outsiders -- will be an organization that builds trust over time. Databases are the key then to both health and trust.
Until now all the components we've talked about -- clients, servers, networks, LANs, and mainframes -- have been components we buy and use. We build on them, but essentially, as components, we use them. The database is different. As a technology, we buy it. However, the real meaning of the term database implies something we build, not buy. Passing through this chapter means we've crossed the divide from buying components to starting to design systems that use those components. The database is one of the three central components we design to have a complete client/server system. We'll look at the other two -- processes and applications -- later. The point is that you are now firmly on the way to understanding what that design is all about.
The dream of a corporate-wide database has excited executives, users, and computer professionals alike for over 20 years. Until recently, that dream has been viewed as a holy crusade: essential but unattainable. The tension between the need for understandable data models and the need for a single, totally consistent global data model has been crushing. The parallel tension between the need for local tailoring and individual freedom stacked up against the need to eliminate redundancy and have central control appeared to have created a complete impasse. Neither solution turns out to be correct. By recognizing the essential distinction between operational and transactional databases, we find a way to synthesize two apparently contradictory objectives and needs. The fundamental philosophical difference between the realtime nature of the transactional system and the extended time nature of the decision support system is the key. Because analytic databases are based on frozen snapshots of historical data anyway, there's no reason to have them dynamically computed from the realtime data on demand, even if that were possible to do. And the cost of such computation makes the linkage impossible in any case. As soon as this separation is recognized as a virtue, not just a necessary evil but a sign of IT maturity and an absolute virtue, then the need for a central database goes away. In fact, the central database is then seen as leading to less, not more, consistency.
258
The database vision of the '90s is, then, the vision of federated databases. Consolidation and consistency, yes. Coordination and control, yes. Personal freedom and local tailoring, also yes. As Winston Churchill said, not the end of the end, not even the beginning of the end, but the end of the beginning. After 20 years, hopefully, we have a plan for finally building a database that works. And the reason we can hope that it works is that we have not a bigger picture of database, but for the first time, a picture of how to build that bigger picture out of smaller pictures, each of which we know we can build.
From an even broader perspective, the database vision has driven the design of large applications for the last 25 years. We are now in the beginning stages of the next revolution as big in its own way as databases were in the early '70s. The new revolution asks us to think about applications as sets of cooperating components. In a way, it shifts the focus of design back from being highly data-centered to being much more functionally centered, without actually asking us to give up the benefits of the database view of the world. The beauty of the concept of federated databases is that it allows us to take everything we've worked so hard to accomplish in terms of database-centered design and transform it directly into the world of cooperating components. Federated databases and cooperating components -- two sides of the same coin.