Tải bản đầy đủ (.pdf) (10 trang)

Hands-On Microsoft SQL Server 2008 Integration Services part 47 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (341.41 KB, 10 trang )

438 Hands-On Microsoft SQL Server 2008 Integration Services
in auditing. This transformation uses the Advanced Editor to expose its properties.
To store the count of rows into a variable, you need to define a variable before you can
use this task. The variable must be in the scope of the Data Flow task to which the
transformation belongs. To count the rows in the data flow and update the variable
value, you simply select the variable in the VariableName field of the Component
Properties tab. You can actually have multiple RowCount transformations writing
to the same variable, for instance, if you are updating auditing information about
loading of data rows you might choose to use same variable in multiple RowCount
transformations. However, you need be aware of the fact that the multiple RowCount
transformations using same the variable could update variable value multiple times, and
hence your auditing could get wrong values, as the variable will keep the last value that
it is updated with.
Figure 10-17 Configuring Audit transformation
Chapter 10: Data Flow Transformations 439
Business Intelligence Transformations
The Business Intelligence transformations enable you to maintain a Slowly Changing
Dimension (SCD); perform data cleaning, data standardization, and text mining
operations; and run data mining prediction queries against data mining models. The
Slowly Changing Dimension (SCD) transformation is slightly complex to understand,
but it provides a configuration wizard that makes the task of loading a dimension table
much easier. You will work through a Hands-On exercise to use an SCD transformation,
and you will perform another Hands-On exercise to remove duplicates from data using
Lookup, Fuzzy Lookup, and Fuzzy Grouping transformations.
Slowly Changing Dimension Transformation
SCD helps you manage slowly changing data attributes in a data warehouse. To describe
this transformation clearly and for the benefit of those who are new in data warehousing,
let’s start with some conceptual details. When a database for a data warehouse is built
using the dimensional modeling design, the data is sliced across many dimensions, such
as time, location, and product, and is stored in two types of tables called the fact table
and the dimension table. A dimension table stores records relevant for that particular


dimension, and a fact table contains measures for the dimensions. For example, a product
dimension table contains information about products such as product category, model
details, and product features. A sales fact table contains measures of sales across different
dimensions, such as total sales of a product, total sales in a month, and total sales at a
location. The advantage of dimensional data modeling is that you can aggregate data
across multiple dimensions—for example, total sales of a product at a location in a year.
To learn more about dimensional modeling, refer to Chapter 12.
One of the issues you will face while maintaining a data warehouse is how to handle
changes to the dimensional data across time. For example, contacts may change their
addresses, products can change over time, new products may be introduced, some old
products may be removed, or your company may decide to reorganize its sales regions.
These changes may pose some challenges when you’re tasked with maintaining the
history of these changes. In order to create complex business reports that contain
aggregations over a period of time, maintaining history in a data warehouse or online
analytical processing (OLAP) system is much more important compared to online
transaction processing (OLTP) systems, which are generally designed to represent the
current state. For example, if a sales representative is allocated a new sales region, the
commission for the sales she has made in the previous region should still go to her; this
makes it necessary to keep a history of changes to her sales region.
To maintain a history for changing data while dealing with restrictions imposed by
data complexity and storage limitations, information engineers have been adopting
440 Hands-On Microsoft SQL Server 2008 Integration Services
different approaches to different types of data. When maintaining a history for a
particular data type is not important, an existing record simply gets overwritten by a
new record. This is the simplest form of handling a slowly changing dimension and
is classified as Type 1. When preserving the history of changing attributes of data is
critical to business, the new record is added to the dimension along with the existing
original record. This is classified as a Type 2 slowly changing dimension. This is a more
commonly used form that allows you to maintain accurate historical data; however, if
the dimension sizes become too big and the number of rows is too high, you may have

to work on storage and performance issues. In a Type 3 slowly changing dimension,
the issue of additional rows with each change is addressed by adding new columns for
the current value and when the current value became effective. Hence, the dimension
contains the original value and the current value, irrespective of how many changes
have occurred in between, thus limiting the ability to register changes accurately; this
does overcome the limitations of Type 2, however. In Integration Services, the slowly
changing dimension transformation doesn’t support Type 3 changes.
The SCD transformation supports up to six different outputs to output records
for different types of processing before loading them into the dimension table. These
outputs have been assigned different names on the basis of different Change Type
records they carry. The outputs are discussed in more detail later in this chapter, but
here are brief descriptions for you:
Changing Attributes Updates Output
c Used for the rows where attributes of
dimension members are changing.
Fixed Attribute Output
c Used for the rows where the attributes of dimension
members are not allowed to change—i.e., the attributes are fixed.
Historical Attributes Inserts Output
c Used for the rows where the attributes of
dimension members are required to keep history.
Inferred Member Updates Output
c Used for the rows where the attributes of
the dimension members are to be updated as the row was earlier inserted as an
inferred member with most of the attributes as blank.
New Output
c Used for the new member rows and for the current member rows
that are to be created as new where the existing member has been updated as
expired to keep history.
Unchanged Output

c Used for the existing members in the dimension table
where no attribute has changed and no work is performed by the SCD; such rows
are diverted to unchanged output and no data flow is created by default; hence,
this acts as a sink for unused rows. However, if you want to capture these rows,
you can collect them by creating a data flow path connected to this output.
Chapter 10: Data Flow Transformations 441
You can use the Slowly Changing Dimension transformation to update and
insert records in data warehouse dimension tables for different types of changes.
This transformation contains a wizard that helps you configure various branches
relevant to the change types. You set various change types on the columns and select
attribute options in a wizard to configure the SCD transformation. When you finish
configurations in the end, the wizard configures the SCD transformation to send the
different types of records based on their change type settings to different outputs. But
the wizard does not stop at simply outputting the records; it also creates downstream
data flow to each used output by adding the required components to load data to the
SCD table. The great thing about this is that you have complete flexibility to modify
the data flow created by the SCD Wizard.
When you connect an input to this transformation in the data flow and double-click
it, you invoke the Slowly Changing Dimension Wizard. The SCD Wizard allows
you to select a dimension table and specify business keys to map transformation input
columns to the dimension table columns. Then you can manage changes in the various
columns by specifying a change type for each dimension column. The following change
types are available while using the SCD Wizard:
Fixed Attribute
c e Fixed attribute change type for a column indicates that the
value in the column is not expected to change and any changes on that column
should be treated as errors. At run time, the SCD transformation performs a
lookup for the incoming row against the dimension table to check for a match.
If SCD doesn’t find any match for a particular row, it diverts that row to the
New Output; however, if a match is found and the matching key contains the

columns with Fixed attribute change types, the SCD transformation diverts the
row to the Fixed Attribute Output. As no change is expected in case of Fixed
attribute change type columns, such changes are not expected to be applied to
the dimension table, hence the SCD Wizard creates no data flow for this output.
However, if you want to capture these rows, you can create a data flow and connect
to the Fixed Attribute Output of the SCD transformation. You can also specify to
fail the transformation if changes are detected in a Fixed attribute in the wizard.
Changing Attribute
c e columns that you select for the changing attribute
change type are treated as Type 1 changes, and the changed rows will overwrite
the existing rows. At run time, the SCD transformation performs a lookup for the
incoming row against the dimension table and checks for the match. If a match is
found and the matching key has the specified columns with changed values, the
transformation directs the row to the Changing Attributes Updates Output. e
SCD Wizard also adds an OLE DB Command transformation connected to this
output to perform the UPDATE operation on the dimension table. If the SCD
442 Hands-On Microsoft SQL Server 2008 Integration Services
transformation finds a matching row during the lookup operation but the row
doesn’t contain any change, such rows are diverted to the Unchanged Output and
the SCD Wizard doesn’t create any data flow for this output by default. However,
if you want to capture such rows, you can create a downstream data flow and
attach it to this output. Your data warehouse may contain multiple records for
a business key, as it keeps the history depending on what kind of changes it has
been through. So when an SCD transformation gets a Changing attribute type
record, it may match against multiple records. For this, you can select an option in
the wizard to indicate whether the task should change all the matching records,
including outdated records, when a change is detected in a Changing attribute.
Historical Attribute
c e Historical attribute change type on a column creates
a new record and marks existing record as expired by changing the date fields

or setting a flag in an indicator column. ese changes are treated as Type 2
changes. At run time, when the SCD transformation finds a matching row that
has changes in a Historical attribute, it diverts the row to a Historical Attribute
Inserts Output. e SCD Wizard creates a data flow, which first updates the
existing record and then inserts the incoming record as a new record with the
same business key in the dimension table. In the data flow that the SCD Wizard
creates, it adds a Derived Column transformation as the first component that is
connected to the Historical Attribute Inserts Output. is component adds a new
column to the incoming record as an expiry indicator (e.g., IsCurrent = False).
e second component added in this data flow branch is an OLE DB Command
transformation that updates the existing record in the dimension table with the
newly added column to mark it as expired (Set IsCurrent = False where IsCurrent
= True). After updating the existing record, it then directs all the columns sans the
newly added indicator column of the incoming record to the New Output data
flow path using the Union All transformation so that a new record is added to the
dimension table. Refer to Figure 10-23, later in the chapter, for clarity on the data
flow it creates. If you find this difficult to understand now, complete the Hands-
On exercise and it will make much more sense. e New Output also gets all the
new records for which SCD did not find any match within the dimension table.
e Union All transformation combines new records and the records coming
from the OLE DB Command component of the Historical Attribute Inserts
Output data flow path for the New Output. After Union All, another Derived
Column transformation is added that adds a new column and the value to indicate
that the record is current (IsCurrent = True to indicate the current state of the
record) before being added to the dimension table.
In addition to the outputs specified in these Change Type descriptions, SCD
includes one more, called Inferred Member Updates Output. This output gets all the
Chapter 10: Data Flow Transformations 443
inferred members’ records. An inferred member record is created in the dimension
table with the minimal data, anticipating that more details of the record will arrive

when a later loading process runs. An inferred member is created because the fact
table contains foreign keys for the dimension tables, and sometimes the loading of the
fact table fails because fact data arrives when a key doesn’t yet exist in the dimension
table. To avoid the failure of a fact table loading process, you should create a record
in the dimension table with minimal data so that when the fact arrives earlier than its
dimension member, it can still be loaded. This inferred member record is updated later
when the attribute data arrives. Support for uploading inferred members details when
they arrive is inbuilt in the SCD transformation, and the SCD Wizard allows you to
choose one of the following methods (see Figure 10-22, later in the chapter) to identify
an inferred member while uploading:
All columns with a change type are null.
c
Use a Boolean column to indicate whether the current record is an inferred member. c
You will use the first option if you have created a minimal record in the dimension
table while loading the fact table, when all the columns with a change type have null
values. Alternatively, you can use a Boolean column to indicate that the incoming record
is an inferred member record. After you specify the inferred member identification and
finish the SCD Wizard, you will see that the wizard has created a data flow with lot of
data flow components connected to different outputs to load the dimension table.
Well, it is time to work through a Hands-On exercise to see how you can load a
slowly changing dimension using the SCD Wizard. Later in the exercise you will see
that the SCD adds all the downstream components for you that you can change if
required. This flexibility takes out the complexity from the development of a package
containing SCD and hence makes it quite easy.
Hands-On: Loading a Slowly Changing Dimension
You are tasked with loading the DimCustomer dimension using a SCD transformation
from the Customer table. The challenge is that the customers were registered using
a variety of methods, and not all customers were allocated a customer number, so the
Customer table contains some records that do not have a Customer ID assigned to
them. You need to identify those customers as well. The other business requirements

are that changes to phone numbers are allowed and you don’t need to keep a history for
these changes; however, a complete history is to be retained for change of address. As
an Email attribute is required for logging on to the customer’s portal, this attribute is
not allowed to be changed.
444 Hands-On Microsoft SQL Server 2008 Integration Services
Method
The Campaign database contains a Customer table that you will use to load the
DimCustomer dimension table. In real life, DimCustomer would be in a different
database, but to keep things simple for this exercise, the DimCustomer table has been
created in the Campaign database. The exercise is divided in four parts to work with
Slowly Changing Dimension.
Exercise (Add Data Flow Task and
OLE DB Source to a New Package)
In the first part you will create a new package and add starting components to the package.
1. Open the Data Flow transformations project in BIDS and add a New SSIS Package.
Rename this new package as Loading Slowly Changing Dimension.dtsx.
2. Drop a Data Flow task from the Toolbox onto the Control Flow surface and
rename it Loading SCD.
3. Double-click the Loading SCD to go to the Data Flow panel. Drop an OLE
DB source from the Toolbox onto the Data Flow surface. Rename the OLE DB
source Customer.
4. Double-click Customer to open the OLE DB Source Editor. Click the New
button next to the OLE DB Connection Manager field. Choose localhost.
Campaign in the Configure OLE DB Connection Manager dialog box and click
OK. Leave Table Or View selected in the Data Access Mode field. Select the
[dbo].[Customer] table from the drop-down list in the “Name of the table or the
view” field. Go to the Columns page to map the External Columns to the Output
Columns. Click OK to close the OLE DB Source Editor.
Exercise (Identify Customers that Do Not Have a Customer ID)
An SCD transformation needs to have a business key so that it can do a lookup with

the dimension table using this key. As a requirement, the SCD transformation doesn’t
allow this key to contain null values. In this part, you will filter out all the records
coming in the pipeline that have a Customer ID equal to Null.
5. From the Toolbox, drag and drop the Conditional Split transformation below
the Customer source. Connect both the components by dragging the green
arrow from the Customer and dropping it on the Conditional Split component.
Rename the Conditional Split component Customers Filter.
6. Double-click the Customers Filter component to open the Conditional Split
Transformation Editor. The interface of this component is basically an Expression
Builder UI. Expand the NULL Functions node in the top-right section of the
Chapter 10: Data Flow Transformations 445
dialog box and drag and drop ISNULL( <<expression>> ) in the Condition
column in the grid. When you click outside the condition field, the expression will
turn red, indicating that there is an error with the expression syntax. This will add
an output with the name Case 1 to this component.
7. Expand the Columns node in the top-left box and drag the CustomerID on
the <<expression>> part of the ISNULL function so that the Condition field
contains ISNULL( [CustomerID] ). When you click outside the Condition field,
the expression turns black, indicating that the expression syntax is correct. The
ISNULL function will be true for the null values of CustomerID column and the
component will divert rows that evaluate true to expression to the Case 1 output.
8. Click in the Output Name column and change Case 1 to Customers without
CustomerID. Click in the Default Output Name field and rename it Customers
with CustomerID, as shown in Figure 10-18. This component has been configured
to output records to two outputs on the basis of whether the CustomerID is a null
or a non-null value. Click OK to close the editor.
9. Drag an Excel destination onto the Data Flow surface below the Customers
Filter. Click the Customers Filter and drag the green arrow onto the Excel
Destination. Select Customers without CustomerID in the Output field of the
Input Output Selection dialog box. Click OK to return to the Designer.

10. Double-click the Excel destination to open the editor. Click the New button next
to OLE DB Connection Manager field. Type C:\SSIS\RawFiles\Customers
without CustomerID.xls in the Excel File Path field in the Excel Connection
Manager and click OK to return. The specified file will be created at run time.
11. Click the New button next to “Name of the Excel sheet” field and replace
Excel Destination written after CREATE TABLE with Customers without
CustomerID in the Create Table dialog box; then click OK twice. Select the
Customers_without_CustomerID in this field. Go to the Mappings page to create
mappings between the input and destination columns. Click OK to close the editor.
Exercise (Configure the Slowly Changing
Dimension Transformation)
Finally, you will configure SCD transformation using the SCD Wizard in this part.
12. Drag the Slowly Changing Dimension transformation from the Toolbox onto
the Data Flow surface just below the Customers Filter component. Connect
Customers Filter by dragging the green arrow connector to the Slowly Changing
Dimension transformation. Rename Slowly Changing Dimension to Loading
DimCustomer.
13. Double-click the Loading DimCustomer transformation. This will start the SCD
Wizard. Click Next to go to the Select A Dimension Table And Keys screen.
446 Hands-On Microsoft SQL Server 2008 Integration Services
14. Make sure that localhost.Campaign is selected in the Connection Manager field.
Select [dbo].[DimCustomer] from the drop-down list in the Table Or View
field. As you select the dimension table, you will see that the Input Columns are
mapped to the Dimension Columns in the grid. Click in the Key Type column of
the CustomerID field and select Business Key from the drop-down list, as shown
in Figure 10-19. Click Next to move on.
Figure 10-18 Conditionally splitting customers on the basis of the CustomerID column
Chapter 10: Data Flow Transformations 447
15. Specify the change types for various columns. As per the requirements set out
in the initial statement of this exercise, you are to select all the address fields as

the
Historical attribute, e-mail as the Fixed attribute, and phone numbers as the
Changing attribute. Click in the first row of the Dimension Columns and select
Address Line 1 from the drop-down list; then click in the Change Type field to
set it as a Historical attribute from the drop-down list. Set the different columns
to different change types as per the following table.
Dimension Columns Change Type
Address Line 1 Historical attribute
Address Line 2 Historical attribute
Address Line 3 Historical attribute
City Historical attribute
State Historical attribute
Figure 10-19 Selecting Business Key in the SCD Wizard
(continued)

×