You are furnished with design tasks and deliverables that can be incorporated into any project, regardless of architecture or methodology. Master the fundamentals of star schema design and slow change processing Identify situations that call for multiple stars or cubes Ensure compatibility across subject areas as your data warehouse grows Accommodate repeating attributes, recursive hierarchies, and poor data quality Support conflicting requirements for historic data Handle variation within a business process and correlation of disparate activities Boost performance using derived schemas and aggregates Star schema complete reference pdf Star Schema The Complete Reference has 43 ratings and 6 reviews.
Star schema complete reference download Comprehensive and a great introduction to. Surrogate keys are assigned and maintained as part of the process that loads the star schema. The surrogate key has no intrinsic meaning; it is typically an integer. Surrogate keys are sometimes referred to as warehouse keys. The surrogate key is the primary key of the dimension table. Illustrations in this book will always list the surrogate key for a dimension table as its first attribute.
Dimension tables also contain key columns that uniquely identify something in an operational system. In the operational systems, these columns identify specific customers, products, and salespeople, respectively. These key columns are referred to as natural keys.
The separation of surrogate keys and natural keys allows the data warehouse to track changes, even if the originating operational system does not. For analytic purposes, however, it may be useful to track the history of ABC Wholesalers.
The two versions can be distinguished by different surrogate key values. While it would also be possible to support change tracking by supplementing a natural key with a sequence number, the surrogate key allows fact and dimension tables to be joined based on a single column.
The term slowly changing dimension refers to the manner in which a dimensional schema responds to changes in a source system. In addition to presenting the facts, the fact table includes surrogate keys that refer to each of the associated dimension tables. It also includes surrogate keys that refer to products, salespeople, customers, and order dates. Together, the foreign keys in a fact table are sometimes considered to identify a unique row in the fact table. This is certainly true in Figure , where each fact table row represents orders of a product sold by a salesperson to a customer on a given day.
In other cases, however, the foreign keys in a fact table are not sufficient to identify a unique row. As we will see in Chapter 3, sometimes a fact table row can be uniquely identified by a subset of its foreign keys, or even by using some nonkey attributes. Each row in the fact table stores facts at a specific level of detail.
The information held in fact tables may be consumed at a variety of different levels, however, by aggregating the facts. In some data warehouse architectures, it is critical that the star schema capture information at the lowest level of detail possible.
In other architectures, this is less important because a separate part of the data warehouse architecture is reserved for atomic data. One or more facts are requested, along with the dimensional attributes that provide the desired context. The facts will be summarized in accordance with the dimensions present in the query. Dimension values are also used to limit the scope of the query, serving as the basis for filters or constraints on the data to be fetched and aggregated.
A properly configured relational database is well equipped to respond to such a query, which is issued using Structured Query Language SQL. Suppose that someone has asked to see a report showing order dollars by product category and product name during the month of January The orders star schema from Figure can provide this information, even though order dollars is stored at a lower level of detail. The SQL query in Figure produces the required results, summarizing tens of thousands of fact table rows.
The SELECT clause of the query indicates the dimensions that should appear in the query results category and product , the fact that is requested order dollars , and the manner in which it will be aggregated through the SQL sum operation.
The FROM clause specifies the star schema tables that are involved in the query. First, it filters the query results based on the values of specific dimension columns month and year.
It also specifies the join relationships between tables in the query. In terms of processing time, joins are among the most expensive operations the database must perform; notice that in the case of a star schema, dimension attributes are always a maximum of one join away from facts.
For readers new to dimensional design, there are two key insights to take away. First, the star schema can be used in this manner with any combination of facts and dimensions.
This permits the star to answer questions that may not have been posed during the design process. Although facts are stored at a specific level of detail, they can be rolled up or summarized at various levels of detail.
The reporting possibilities increase dramatically as the richness of the dimension tables is increased. Second, note that the ability to report facts is primarily limited by the level of detail at which they are stored. While it is possible to aggregate the detailed fact table rows in accordance with any set of dimensions, it is not possible to produce a lower level of detail. If a fact table stores daily totals, for example, it cannot be used to look at an individual order.
The importance of this limitation depends in part on your data warehouse architecture, as you will see in the next chapter. Of course, star schema queries can get much more complex than this example. Queries may build on this template in a number of ways. A very important type of report requires that we merge query results sets from more than one star. It is also possible that facts may be aggregated in other ways, perhaps by averaging them or simply counting them.
Part I important feature of the star schema: how it is actually used. Understanding the basic usage pattern of the star schema allows the dimensional designer to make intelligent choices. Browsing is the act of exploring the data within a dimension. The results of browse queries appear as reference data, and may make useful reports. A browse activity may also be an exploratory precursor to a larger query against the fact table.
Instead, queries may browse for distinct combinations of attribute values. Figure shows some queries that browse the product dimension. The first browse in Figure simply fetches a list of product categories. The second browse seeks the list of products within a specific category. Browse queries may return many attributes from within a dimension; some tools support browsing in a grid-like interface. The browse query is important in several respects.
It may serve as the basis for the selection of query predicates, or filters, for a query that involves a fact table. A browse query may also allow users to explore the relationship between dimension values.
This kind of browsing may be considered when making decisions about how to group attributes into dimensions, as discussed in Chapter 6. Guiding Principles The remainder of this book covers a wealth of dimensional design techniques which you can use to describe any business process.
Sometimes it will be useful to understand the reason some of these techniques have been developed. Two simple guiding principles drive these decisions: accuracy and performance. It may seem obvious, but it is important to consider the accuracy of any given design.
The questions that will be asked of an operational system can be determined in advance, and remain consistent over time, but analytic questions always lead to new questions. They will change over time, sometimes dramatically so. Designers must pay close attention to how a dimensional schema represents facts.
Is it possible that they will be aggregated in ways that do not make sense? Is there a design alternative that can prevent such a situation? Of equal importance is the performance of the schema. An analytic design may offer little value over an operational design if it cannot produce timely results. Dimensional designs are very good at providing a rapid response to a wide range of unanticipated questions.
There will be times, however, when a basic design may not be able to serve important business needs efficiently. The performance profile of a solution may drive the decision to provide information in more than one format, as will be seen throughout this book. Summary Dimensional modeling is a design approach optimized for analytic systems. A dimensional model captures how a process is measured. Data elements that represent measurements are called facts.
Data elements that provide context for measurements are called dimensions. These elements are grouped into dimension tables and fact tables. Implemented in a relational database, the design is called a star schema. The dimension tables in a star schema employ surrogate keys, enabling the analytic system to respond to changes in operational data in its own way. The granular facts in a star schema can be queried at various levels of detail, aggregated according to desired dimensional context.
Exploring the details within a dimension is referred to as browsing. This chapter has only begun to introduce the fundamentals of dimensional design. After a discussion of architectures in Chapter 2, Chapter 3 will return to the basics of dimensional design. This book fully explains the principles of normalization used to support transaction processing in a relational database management system. A wealth of information is available on the differences between operational and analytic systems.
For more information on separating facts from dimensions, you can consult any book on dimensional design. These books also cover the prototypical query pattern for a star schema; the browse query is discussed in The Data Warehouse Toolkit. That may overstate things, but everyone will agree to this: data warehouse architectures vary widely.
One of the ways in which data warehouse architectures diverge is in their use of dimensional design. Some architectures place a heavier emphasis on the star schema, while others use it in a limited capacity.
The principles of dimensional design are the same, wherever they are put to use. This book is concerned with these principles. With a diversity of architectures, however, comes confusion. The same terms are used to describe different things. Different terms are used to describe the same thing.
Characteristics of one approach are misinterpreted to apply in other situations. In order to understand dimensional design, it is important to clear up this confusion.
To do so requires a brief look at data warehouse architecture. This chapter groups data warehouse architecture into three categories.
The first two are often called enterprise data warehouse architectures, and are closely associated with W. Inmon and Ralph Kimball, respectively. The third does not have a well-known figurehead but is equally common.
While these architectures differ in fundamental ways, there is a place for the star schema in each of them. By understanding these approaches, we can avoid misunderstandings in terminology and develop a clear understanding of the capability of the star schema. There is no discussion of pros and cons.
Nor will you find comprehensive specifications for each architecture. Instead, the objectives for this chapter are simple: 1. To understand each approach at a high level 2. To understand the place of the star schema in each 3. Each real-world implementation is different. Yours may contain elements from one or more of these architectures. You should make an effort to understand the alternatives, however. This will give you a better grasp of what is and what is not true about dimensional design.
Those words were written by W. Inmon, in an article that appeared in DM Review magazine. Bill Inmon is a prolific writer and contributor to the data warehousing community. Through hundreds of articles and dozens of books, he has developed and shared an approach to data warehousing that he calls the Corporate Information Factory. This hub-and-spoke architecture is common, even in IT shops that do not attribute their architecture to Inmon.
A highly simplified depiction of the Corporate Information Factory appears in Figure Some liberties have been taken, removing numerous components that are not relevant to this discussion and using some generic terminology. To understand this architecture, start by looking at the left side of the diagram. There, you will find the operational systems, or transaction systems, that support the business.
The data stores associated with these systems may take a number of different forms, including hierarchical data, relational data, and even simple spreadsheets. For the sake of simplicity, only four operational systems are depicted. This processing step is nontrivial. It may require accessing information in a variety of different formats, resolving differing representations of similar things, and significant restructuring of data.
Some organizations refer to this process as data integration. It may be a batch process that runs periodically or a transaction-based process that occurs in near real time. The final result is the same: the enterprise data warehouse. The enterprise data warehouse is the hub of the corporate information factory. It is an integrated repository of atomic data. Integrated from the various operational systems, it contains a definitive and consistent representation of business activities in a single place.
Atomic in nature, the data in this repository is captured at the lowest level of detail possible. Instead, its purpose is to feed additional data stores dedicated to a variety of analytic systems. The enterprise data warehouse is usually stored in a relational database management system, and Inmon advocates the use of third normal form database design. Surrounding the enterprise data warehouse are numerous other components. Of interest here are the data marts, which appear along the top of the diagram.
These are databases that support a departmental view of information. With a subject area focus, each data mart takes information from the enterprise data warehouse and readies it for analysis.
As the earlier quotation suggests, Inmon advocates the use of dimensional design for these data marts. The data marts may aggregate data from the atomic representation in the enterprise data warehouse. Note that Inmon reserves the term ETL for the movement of data from the operational systems into the enterprise data warehouse. The data marts serve as the focus for analytic activities, which may include queries, reports, and a number of other activities. These activities are enabled by a variety of different tools, including some that are commonly referred to as business intelligence tools and reporting tools.
This book will collectively refer to these tools as business intelligence tools. Note, though, that Inmon reserves this term for a particular application in the Corporate Information Factory. First, in the s, he was largely responsible for popularizing star schema design. Through his writings, Kimball synthesized and systematized a series of techniques that had been in use as early as the s.
He explained how dimensional design provided an understandable and powerful way to develop analytic databases, and he gave us the terminology that is used throughout this book. Second, Kimball developed an enterprise architecture for the data warehouse, built on the concept of dimensional design.
It allows for an integrated repository of atomic data and relies on dimensional design to support analytics. Because he is so closely associated with the star schema, he is often assigned blame for shortcomings associated with any implementation that utilizes a star, regardless of its architecture.
Other times, the star schema itself is assigned blame. Again, the diagram is somewhat simplified. Though the diagram in Figure appears quite different from that in Figure , the two architectures actually share many characteristics in common. Like the Corporate Information Factory, this architecture begins by assuming a separation of the operational and analytic systems. As before, operational systems appear on the far left of the diagram. Again, these may incorporate data stores that are relational and nonrelational, and are likely to be numerous.
Moving to the right, an ETL process consolidates information from the various operational systems, integrates it, and loads it into a single repository. If that sounds familiar, it should. The Corporate Information Factory has an analogous process. The dimensional data warehouse in the center of Figure is the end result of the ETL process. It is an integrated repository for atomic data. Again, that should sound familiar. It contains a single view of business activities, as drawn from throughout the enterprise.
It stores that information in a highly granular, or atomic, format. The dimensional data warehouse differs from the enterprise data warehouse in two important ways. First, it is designed according to the principles of dimensional modeling. This contrasts with the Inmon approach, where the enterprise data warehouse is designed using the principles of ER modeling. Second, the dimensional data warehouse may be accessed directly by analytic systems. Although it is not required, this is explicitly permitted by the architecture.
The concept of a data mart becomes a logical distinction; the data mart is a subject area within the data warehouse. In Figure , this is represented by the box that highlights a subset of the tables in the dimensional data warehouse.
These two key differences are often tempered by accepted variations in the architecture. The construction of a dimensional design from a variety of operational data sources can be challenging, and ETL developers often find it useful to design a multi-step process. Sometimes, a set of tables in third normal form is an intermediate step in this process.
Kimball considers this an acceptable feature of a dimensional data warehouse, provided that these staging tables are not accessed directly by any processes other than the ETL process.
When such a set of tables is in place, the dimensional data warehouse comes to resemble the Corporate Information Factory more closely. Both contain a normalized repository of data not accessed by applications, and dimensional representations that are accessed by applications.
In another accepted variation in the architecture, architects choose to insulate the dimensional data warehouse from direct access by analytic applications. In such cases, new data marts may be constructed by extracting data from the dimensional data warehouse. These data marts may aggregate the dimensional data, or even reorganize it into new dimensional structures. Again, this variation increases the resemblance to the Corporate Information Factory, where data marts are seen as separate entities from the integrated repository of atomic data.
In fact, the dimensional data warehouse may be a single logical repository, distributed among numerous physical databases. As you will learn in Chapter 5, this concept does not benefit the Kimball architecture exclusively.
In the case of the dimensional data warehouse, it is a central principle. As previously mentioned, this book will use the term dimensional data warehouse to refer to this architecture. The term ETL will be used in the broad sense, referring to any activity that moves data from one database to another. Likewise, tools and applications that access analytic data, including packaged business intelligence tools, reporting tools, and analytic applications, will be lumped together under the term business intelligence tools.
Stand-Alone Data Marts The final architecture to be discussed in this chapter is the stand-alone data mart. Unlike the architectures described previously, stand-alone data marts are not closely associated with any well-known advocate.
There is good reason for this. While stand-alone data marts may achieve rapid and inexpensive results in the short term, they can give rise to long-term costs and inefficiencies. These shortcomings are not always reason enough to eschew the stand-alone data mart, but they have contributed to confusion over the capabilities of the star schema.
The stand-alone data mart is an analytic data store that has not been designed in an enterprise context. It is focused exclusively on a subject area. One or more operational systems feed a database called a data mart. The data mart may employ dimensional design, an entity-relationship model, or some other form of design. Analytic tools or applications query it directly, bringing information to end users.
This simple architecture is illustrated in Figure Development of a stand-alone data mart is often the most expedient path to visible results. Because it does not require cross-functional analysis, the data mart can be put into production quickly. No time must be spent constructing a consolidated view of product or customer, for example.
Instead, the implementation takes a direct route from subject area requirements to implementation. Because results are rapid and less expensive, stand-alone data marts find their way into many organizations. They are not always built from scratch. A stand-alone data mart may become part of the application portfolio when purchased as a packaged application, which provides a prebuilt solution in a subject area.
Packaged data marts may also be available as add-ons to packaged operational applications. Prebuilt solutions like these can further increase the savings in time and cost.
Even in organizations committed to an enterprise data warehouse architecture, standalone data marts can be found. Sometimes, they are present as legacy systems, in place before the commitment to the enterprise architecture.
In other cases, they may be built within user organizations, entirely outside the domain of the IT department. Mergers and acquisitions can bring with them new analytic data stores that have not been integrated into the preexisting architecture.
For all these reasons, the stand-alone data mart is a reality for many businesses and organizations. Yet it is almost universally maligned. While often considered a short-term success, the stand-alone data mart frequently becomes a long-term headache.
To understand why, it helps to look at what happens when more than one subject area is supported via stand-alone data marts. Figure depicts the proliferation of stand-alone data marts across multiple subject areas. While a single stand-alone data mart may appear to be the most efficient path to results, the presence of multiple data marts exposes inefficiencies. In Figure , multiple ETL processes are loading data from the same source systems.
The data marts themselves may be based on different technologies, and the user audiences may be relying on separate query and reporting infrastructures. They compound the cost of the total solution, requiring the maintenance of redundant technologies, processes, and skill sets.
Even when these technical inefficiencies are minimized, a more serious deficiency may be lurking in the data itself. If each data mart is built to address a narrow set of needs, what happens when these needs expand?
Lacking a repository for granular data, a data mart may fail to answer a future question that requires more detail than originally anticipated. If these subject areas do not share consistent definitions of common entities such as products, departments, or customers , then it may be impossible to compare the information. Worst of all, redundant load processes may apply different rules to source data, leading to systems that provide contradictory results.
These issues cause stand-alone data marts to become islands of information. Developed to satisfy a narrow set of needs, they fail to support cross-functional analysis. Extensive rework may be required to adapt them to a deeper or wider set of demands.
Short-term savings give way to long-term costs. These deficiencies should not necessarily preclude the implementation of a stand-alone data mart. As long as there is a shared understanding of the potential future cost, a subject area focus may make sense.
It keeps costs low and minimizes activities that precede the delivery of some initial capability. Too often, though, the easy route is taken without buy-in from all parts of the business. Stand-alone data marts often employ dimensional design.
This is so common, in fact, that the shortcomings of stand-alone data marts are sometimes blamed on the star schema. It has become a common misconception that the star schema is for aggregated data, or that the use of the star schema leads to stovepipes.
By now it should be clear that these failures are not the result of the use of dimensional design. Stand-alone data marts may contain aggregated data, and they are likely to exhibit incompatibilities with one another, but this is not a failure of the star schema. Rather, it is a shortcoming of the narrow scope of the stand-alone data mart. Architecture and Dimensional Design All of these architectures are successfully put to use by businesses and organizations throughout the world.
Your data warehouse architecture may closely match one of these paradigms, or you may find it incorporates elements of each. A high-level comparison of these approaches allows you to cut through the noise and confusion that surround the star schema. The three architectural paradigms discussed in this chapter are summarized in Figure They aim to support analytic needs across a business or organization. Learn the best practices of dimensional design. Star Schema: The Complete Reference' offers in-depth coverage of design principles and their underlying rationales.
The star schema gets its name from the physical model's [3] resemblance to a star shape with a fact table at its center and the dimension tables surrounding it representing the star's points.
Discussions on developments include data marts, real-time information delivery, data visualization, requirements gathering methods, multi-tier architecture, OLAP applications, Web clickstream analysis, data warehouse appliances, and data mining techniques.
Barry Devlin--one of the world's leading experts on data warehousing--is also one of the first practitioners in this area. It also details testing and how to administer data warehouse operation. Our web service was launched by using a hope to function as a complete on-line computerized catalogue that o:ers use of multitude of PDF file e-book selection. To get a basic to intermediate level of understanding of data warehouse Dimensional Modelling in general read the following books.
Role identifier: dataAdmin. Create, drop, or purge Data Warehouse tables. It is actually writter in straightforward words instead of hard to understand.
Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications.
This new edition is also an excellent reference for analysts, researchers, and practitioners working with quantitative methods in the fields of business, finance, marketing, computer science, and information technology. Learn the best practices of dimensional design. As of March , this exam was updated. This compilation includes publications for practitioners of all skill levels. In addition, it covers modern analytics architecture and use cases. It is aimed primarily at the IS managers, architects, and designers involved in this process, as well as the end users having a key role in the evolving implementation of the data warehouse.
The author explains why old systems and processes can no longer support data needs in the enterprise. This practical book is the canonical reference to Google BigQuery, the query engine that lets you conduct interactive analysis of large datasets. Data warehouse is an information system that contains historical and commutative data from single or multiple sources.
0コメント