Tải bản đầy đủ (.pdf) (10 trang)

Hands-On Microsoft SQL Server 2008 Integration Services part 42 doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (118.65 KB, 10 trang )

388 Hands-On Microsoft SQL Server 2008 Integration Services
In the Component Properties tab, you specify common and custom properties for
this component. The properties you can specify are:
e number of seconds before the command times out in the CommandTimeout
c
field. By default, this field has a value of 0, which indicates an infinite timeout value.
e code page value in the DefaultCodePage field when the data source is unable
c
to provide the code page information.
e LocaleID, which can be changed from the default value to any other
c
Windows LocaleID.
e ValidateExternalMetadata value, which can be changed from the default value
c
of True to False if you do not want your component to be validated during the
validation phase.
e SQLCommand field, where you can type in the SQL statement that this
c
transformation runs for each row of the input data. You can use parameters in
your SQL statement and map these parameters to the input columns so that the
SQL statement is modifying the data, made available by OLE DB Connection
Manager, on the basis of values in the input columns. By default, these parameters
are named as Param_0, Param_1, and so forth; and you cannot change these
names. However, if you use a stored procedure with sql variables in it, then you
can use the variable names that make the mappings of parameters easier. is
option has been explained later in the chapter (refer to Figure 10-9).
The mappings between parameters used in SQL statements and input columns are
defined in the Column Mappings tab. In this tab, after typing the SQL statement in
the SQLCommand field, you will see the Available Destination Columns populated
with parameters for you. (You may have to click Refresh to see the parameters.)
The automatic population of parameters is dependent on the ability of the OLE


DB provider you’ve specified earlier. For some third-party OLE DB providers
that do not support deriving parameter information from the SQL statement, you
will need to manually create parameter columns in the External Columns node in
OLE DB Command Input by going to Input And Output Properties tab; assigning
them names such as Param_0, Param_1, and so on; and specify a value of 1 to the
DBParamInfoFlags custom property of the column. You can then manually create
mappings between Available Input Columns and Available Output Columns by using
the drag-and-drop technique.
Chapter 10: Data Flow Transformations 389
Split and Join Transformations
These transformations can create multiple copies of input data, split input data into one
or more outputs, merge multiple inputs, or add columns to the pipeline by looking up
exact matches in the reference table. After reading through the descriptions for these
split and join transformations, you will complete your first Hands-On exercise for this
chapter that will cover some of the Row transformations also.
Conditional Split Transformation
Sometimes you will need to work with a particular data set—i.e., a subset—separately
from the data that is coming in the pipeline. For example, you may want to apply
different business rules to your different types of customers. In such cases, you will need
to split data to match the specific criteria into the multiple data sets using conditions
that are defined in the Conditional Split transformation. This transformation allows
you to create more than one output and assign a condition (filter) to the output for the
type of data that can pass through it. Among the outputs, one output has to be a default
output for the rows that meet no criteria. When an input data row hits the Conditional
Split transformation, it passes the data row through a set of conditions one by one and
will route the data row to the output to which it matches the criteria first. Each row
can only
be diverted to one output, and this output has to be the first one for which the
condition evaluates to True. This transformation has one input, one error output, and
as you can make out, multiple outputs.

The user interface of the Conditional Split Transformation Editor is similar to
that of a property expression. You can select variables and columns from the top-left
section and functions and operators from the top-right section of the editor window.
You can build an expression using variables, columns, functions, and operators in the
Condition field for an output. As you add a condition, an output is created with an
order number specified in the Order column and a name is assigned to it in the Output
Name field. The order number plays a vital role in routing the rows to outputs, as the
row is matched to these conditions in an ascending order and the row is diverted to
the output for which the condition becomes true first. Once the row has been diverted,
the rest of the conditions are ignored, as a row is always sent only to one output. If a
row doesn’t meet any condition, in the end it will be sent to the default output of the
transformation. The Output Name can be changed to a more appropriate name that
suits your data set. This transformation has a mandatory default output built in for you.
The syntax you use to build an expression for a condition uses the expression
grammar. For example, if you want to split customers on the basis of countries—
390 Hands-On Microsoft SQL Server 2008 Integration Services
e.g., you want to separate out rows for UK, USA, and rest of the world—your Conditional
Split transformation will have three outputs with the following configurations:
Order Output Name Condition
1 Case 1 [Country] = = “UK”
2 Case 2 [Country] = = “USA”
The default output will collect all the rows for which Country is neither UK nor USA.
Multicast Transformation
The Conditional Split transformation enables you to divert a row to a specific output
and split a data set. It doesn’t allow you to divert a row to more than one output and
hence does not create a copy of the data set. However, sometimes you may need
to create copies of a data set during run time so that you can apply multiple sets of
transformations to the same data. For example, working within the same ETL, you
may want to treat your sales data differently while preparing data for balance sheets,
than while using it to calculate commission paid to the salespersons. Another example

could be when you want to load data to your relational database as well as the analytical
reporting data mart, with both having different data models and obviously different
loading requirements. In this case, you will create two sets of data, apply different
transformations to the data to bring them in line with the data model requirements,
and then load the data to the destination database. This transformation has one input
and supports multiple outputs to perform multicast operation.
The Multicast transformation creates copies of a data set by sending every row to
every output. It does not apply any other transformation to the data other than diverting
a data set to more than one output. The transformation is simple to use and doesn’t
require much to configure. In fact, when you open the Multicast Transformation
Editor, you will see nothing to configure with the two blank panes aligned side by side.
This is because you configure this transformation by connecting its outputs to inputs
of multiple components and not by setting properties on the attributes. On the Data
Flow surface, when you click the Multicast transformation after connecting its first
output to a downstream component, you will see that it provides another output (a green
arrow emerging from the transformation) for you to connect to another downstream
component. If you double-click the Multicast transformation after connecting the
second output to another downstream component, you will see two outputs listed in the
left pane of the editor window, and the right pane shows the properties of the output
selected in the left pane. You can modify the name of the output and write a description
for it if you want and that’s it.
Chapter 10: Data Flow Transformations 391
Union All Transformation
This component works similar to the Union All command of T-SQL and combines two
or more inputs into a single output. To support this functionality, this transformation has
multiple inputs and one output and does not support an error output. During run time,
the transformation picks up all the rows from one input, sends them to the output, then
picks up all the rows from the second input, sends them to the output after combining
them with the rows of the first input, and this goes on until all the inputs have been
combined. Note that this transformation can select inputs in any random order based on

when the data is available at the inputs. Also, it does not sort the records in any way. All
the rows from a particular input will be output together.
On the Data Flow Designer when you connect an input to this transformation,
this transformation copies the metadata from this first input to its output so that the
columns of this first input are mapped to the output created by the transformation. The
columns having matching metadata of any input you connect after the first input are
also mapped to the corresponding Output columns. Any column that does not exist
in the first input, but is a part of subsequent input, will not be copied in the output
columns by the transformation. You must create and map this column manually in the
Output Column Name field of the output with the corresponding Input column. If
you don’t create and map a column that exists in subsequent inputs but not in the first
input, that column will be ignored in the output. You can also change mappings in the
editor by clicking in the Input field and choosing an option from the drop-down list.
Merge Transformation
As its name suggests, the Merge transformation combines two inputs into a single
output but requires that the inputs be sorted and the columns have matching metadata.
To perform merge functionality, this transformation has two inputs and one output. It
does not have an error output. The Merge transformation can even provide you sorted
output by inserting rows from inputs based on their key values. This transformation is
similar to the Union All transformation with some key differences:
e Merge transformation requires sorted inputs, whereas the Union All
c
transformation does not.
e Merge transformation can combine only two inputs, whereas
c Union All can
combine more than two inputs.
e Merge transformation can provide sorted output, whereas Union All cannot.
c
When you add a Merge transformation into a package’s data flow, you won’t be able
to open its editor until you have attached two sorted inputs that have properly defined

392 Hands-On Microsoft SQL Server 2008 Integration Services
sort-key positions for the columns on which they are sorted. The input data sets must be
sorted—for example, using sort transformation—which must be indicated by the IsSorted
property on the outputs and the SortKeyPosition property of the output columns of the
upstream components. These properties are discussed in Chapter 15, where you can refer
to Figure 15-4 and Figure 15-5 to see how these properties are used.
Once the sorted inputs are connected to the inputs of this transformation and you are
able to open the editor, you will see the output columns mapped to the Input columns
of both the inputs. This transformation creates the output by copying the metadata of
the first input you connect to it. You can make changes to the mappings. However, the
metadata of the mapped columns must match. If the second input has a column that is
not in the Output columns list, you can add and map this column manually.
Merge Join Transformation
The Merge Join transformation joins two sorted inputs into one output using a full,
left, or inner join. For merging two inputs, this transformation supports two inputs
and one output and does not support an error output. The joins work exactly as it does
in T-SQL. For the benefit of those who haven’t worked with joins, here is a quick
refresher for these joins.
Inner Join
An inner join returns the rows that have a join key present in both the data sets. For
example, suppose you have a table for all the customers, a table for all the products that
your company has sold in last ten years, and a date table. An inner join will return a list
of customers and the products who purchased any product in year 2009.
Left Outer Join
A left outer join returns all the rows from the first (left side of the join) table and the
rows from second table for the matching key. For the rows that don’t have a matching
key in the second table, the corresponding output columns are filled with nulls. For
the customers and products example, a left outer join will list all the customers with
the products listed against the customers who have made purchases in 2009 and nulls
against the customers who didn’t make a purchase in 2009.

Full Outer Join
A full outer join returns all the rows from both tables joined by the key—simply put,
it is a list of all the data. If the rows from first table don’t have a matching key in the
second table, the corresponding second table columns are filled with nulls. Similarly,
if the rows from the second table don’t have a matching key in the first table, the
corresponding first table columns are filled with nulls.
Chapter 10: Data Flow Transformations 393
When you add a merge join transformation into a data flow, you won’t be able to
open the editor until you have attached two sorted inputs—one to the merge join left
input and the other to the merge join right input. The input data sets must be sorted
physically—for example, using a sort transformation—which must be indicated by
the IsSorted property on the outputs and the SortKeyPosition property of the output
columns of the upstream components. These properties are discussed in Chapter 15,
where you can refer to Figure 15-4 and Figure 15-5 to see how these properties are used.
Once the sorted inputs have been connected to the left and right inputs of this
transformation, you can open the editor. You can choose among inner join, left outer
join, or full outer join types in the Join Type field. You can also swap your inputs by
clicking the Swap Inputs button for a Left Outer Join. After selecting the join type, you
can specify the join keys if they have not already been picked up by the transformation.
The joined columns must have the matching metadata, and the join key must be in
the same order as specified by the sort key. Next, you can select the output columns
by selecting the check boxes or using the drop-down lists in the Input columns in the
lower section of the Transformation Editor. You will also be able to assign an alias to
the Output column. Once you have selected the required columns for the outputs, you
can close the editor, as there is no other configuration in the editor; however, there are
couple of important properties that you may configure for this transformation. You
can access these special properties in the properties window when you press the  key
while this transformation is selected.
The first performance-related property is the MaxBuffersPerInput property,
which lets you specify an integer value for the number of buffers for each input that

you want to suggest. Merge Join is a semi-blocking transformation; that is, it needs
a set of rows before it could start outputting any record. You can well imagine this
as the transformation needs a good set of rows or buffers for each input to be able
to join them and start spitting out the resultant rows. The second reason for this
transformation to collect large sets of data is to avoid deadlocking between the threads
that are processing data. If you are working with really large data sets, the Merge Join
transformation can cache huge amounts of data, putting high pressure on memory
available on the computer. This could affect performance of other applications on the
server. To avoid such a situation, you can throttle the memory requirements using
the MaxBuffersPerInput property. While this property allows you some control, it is
just a suggestive value and not a hard value that forces the Merge Join to use only the
specified number of buffers. If the transformation feels that there is a risk of thread
deadlocking or it doesn’t have enough data to perform, it will increase the number of
buffers required to some other value than what has been specified. However, it does
stay within specified limits if no such risk exists. The default value for this property is
five buffers per input, which works well with most of the requirements. You can specify
a larger value to improve performance if you are working with large data sets and have
394 Hands-On Microsoft SQL Server 2008 Integration Services
enough free memory on the server. Alternatively, you can decrease the value from the
default value if the server is already struggling with memory pressure. But as you can
imagine, reducing this value to 0 will disable all throttling and can adversely affect
performance. You need to be a bit more cautious when configuring this transformation
and configure a more realistic setting, as the memory usage could go too much off the
mark when configured too low for a large data set. Another issue has been observed
when both the input data streams hit this transformation at wildly different times.
Testing your data flow is the best recommended approach for finding the right balance
with this transformation.
Last, you can specify to treat null values equal by using the TreatNullsAsEqual property
of the transformation; the default is True for this property. If you decide not to treat
nulls as equal, the transformation then treats nulls similar to the database engine.

Cache Transform
The Lookup transformation in Integration Services 2008 can use an in-memory cache
or a cache file to load reference data to perform lookups on the data in pipeline. The
in-memory cache or the cache file needs to be created for the lookup transformation
before it starts execution. This means you can create a reference data cache either
before the package starts running or during the same package execution, but before
the lookup transformation. Using a cache transformation, you can write distinct data
rows to a cache connection manager that, depending upon how you have configured
it, can write this data into cache or can also persist this cached data to a file on the
hard disk. The user interface of the Cache Transformation Editor is simple and has
only two pages. In the Connection Manager page, you specify the Cache Connection
Manager, and in the Mappings page, you map Input columns to the destination
columns being written to a Cache Connection Manager. Here, you have to map all
the input columns to the destination columns; otherwise, the component will throw an
error. If you change the data type of a column later—e.g., you increase the length of
an input column—the metadata can be corrected in the Cache Connection Manager,
which displays the columns and their metadata in the Columns tab. While configuring
the Cache Connection Manager, you will also need to configure an Index Position for
the columns. By default, the index position is 0 for all the columns, indicating that
these columns are non-index columns. You can assign positive integer values such as
1, 2, 3… and so on to the columns that are participating in indexing. The number
assigned to an index position on a column indicates the order in which the Lookup
transformation compares rows in the reference data set to rows in the input data
source. The selection of the columns as index columns depends upon which columns
you need to look up while configuring a Lookup transformation. For example, if you
Chapter 10: Data Flow Transformations 395
are matching for car details based on the manufacturer, the model description, and the
body style, you will assign the index positions as shown in this table:
Column Name Index Position Index Position Description
Manufacturer 1 This column participates in the lookup operation and should be mapped to the

input column in a lookup transformation. This is the first column on which the
comparison will be made.
Model Description 2 This column participates in the lookup operation and should be mapped to the
input column in a lookup transformation. This is the second column on which
the comparison will be made.
Body Style 3 This column participates in the lookup operation and should be mapped to the
input column in a lookup transformation. This is the third column on which the
comparison will be made.
All other columns 0 These columns do not participate in the lookup operation mappings; however,
they can be selected in order to add them into the data flow.
As the Cache Connection Manager writes data to memory, it gets tied up to the cache.
Hence, you cannot use multiple cache transforms to write to the same Cache Connection
Manager. The cache transform that gets called first during the package execution writes
data to the Cache Connection Manager while all subsequent cache transforms fail. As
mentioned earlier, you can create a cache file using a cache transform in a separate package
or in the same package that runs before the Lookup transformation. So, once the reference
data is persisted to a cache file, which is a raw file, you can use this cache file among
multiple data flow tasks within the same package, between multiple packages on the same
server, or between multiple packages on different servers. The only consideration you have
to keep in mind is that the data in the cache file will be current to the time it is loaded into
the cache file. If your reference data doesn’t change that often, you can use this technique
to perform quite fast lookup operations.
Lookup Transformation
With the ever-increasing use of the Web to capture data, lookup operations have
become quite important. Web page designers tend to ask users to fill in the most
critical data in a web form, and the form fills in rest of the information for them. For
example, you may ask a visitor on your site to fill in a street address and postal code
during registration, and based on these two pieces of information, you can fill the city
and state address fields. This is done by looking up a database table that contains all
the postal information keyed in with postal codes. So, you simply look for the row

that contains the particular postcode and you will be able to complete the address
fields. You can perform such lookup operations in Integration Services also—using
396 Hands-On Microsoft SQL Server 2008 Integration Services
the Lookup transformation. While loading a data warehouse, you will frequently use a
Lookup transformation to look up surrogate keys in a dimension table by matching the
business key.
This transformation lets you perform lookups by joining the input columns in the
data flow with columns in a reference data set. The resulting values can be included
in the data flow in the new columns, or you can replace the existing column values.
For example, when a value of a record in the postal code column in the data flow
is equal to a record in the postal code column in the reference data set, the lookup
transformation will be able to get data for all the other address columns. This
equality operation makes this join an equijoin, requiring that all the values in the
data flow match at least one value in the reference data set. In a complex lookup
operation, you may be joining multiple columns in the input to the multiple columns
in the reference data set.
This transformation has undergone a major upgrade since its predecessor in Integration
Services 2005. The user interface is also changed and provides more flexibility and
performance improvements. The Lookup Transformation Editor provides five pages
to configure its properties. When you open the editor, you will choose the options in
the General page to decide how you want to configure this transformation. Let’s start
with the simplest one—the Lookup transformation allows you to configure its outputs
on the basis of how you want to handle the non-matching rows. The “Specify how to
handle rows with no matching entries” option in the General page provides you four
options to choose from:
Ignore failure
c
Redirect rows to error output c
Fail component (default) c
Redirect rows to no match output c

When the transformation looks up an input data key against the reference data set,
you can either have a match or no match. When the input row key is found in the
reference data set, the row is called a match row and is sent to the match output. If a
key in the input row is not found in the reference data set, the row is called a no-
match row and this option allows you to decide how you want to handle such rows.
In Integration Services 2005, no-match rows are treated as errors and are sent to
error output, whereas in Integration Services 2008, you can choose to redirect non-
matching rows to a no-match output. Choosing an option other than “Redirect rows
to no match output” will treat no-match rows as error rows and will send them to
error output.
Chapter 10: Data Flow Transformations 397
Next you can select how you would like to cache your reference data set. The Cache
Mode section provides three options to choose from:
Full cache
c Selecting this option will force the reference set to be prepared
and loaded into the cache memory before the lookup operation is performed.
For example, if you are looking for postal addresses, all the rows with postal
information will be loaded into the memory before actual lookup is performed
against this reference data set. As you can imagine, lookup against a reference set
that is held in memory will be faster compared to the lookup operation where the
component has to make a connection to the outside data store to get the lookup
values. Full cache mode is the best-performing option if you have enough physical
memory on the server to hold the complete reference data set. When you choose
this option, two things happen in the user interface:
e Advanced page is disabled. e options in this page are not used when
c
using full cache mode.
e Cache Connection Manager option becomes available. e Lookup
c
transformation can use either an OLE DB Connection Manager or a

cache connection manager in the case of full cache mode. is feature is a
new addition in this version. e Cache Connection Manager can read the
reference data from a cache file that can be prepared separately from the
lookup process. You can actually prepare the reference data set separately, in
the same Data Flow task before the Lookup transformation, in an earlier
Data Flow task in the same package, or even in a separate package using a
Cache transform. When the data is prepared and persisted to a cache file
(.caw), the Lookup transformation can load the data from the cache file faster
than from the relational sources using a Cache Connection Manager. So, by
utilizing full cache mode and sharing a cache file between multiple lookups
within the same package or in multiple packages, you can achieve high levels
of performance, especially when the reference data set is not small. e only
caveat here is that your reference data set must not change. If it changes with
each data load, you have to be careful while using the Cache transform and
the Cache Connection Manager in your package.
Partial cache
c Selecting this option will enable you to apply limits on the use of
memory for caching reference data. When you select the partial cache mode, the
Advanced page options become available, but the Cache Connection Manager
option is disabled, as it is used only in the full cache mode. So, you can only use
the OLE DB Connection Manager to collect reference data from a relational
source. In the Advanced page, you can limit the cache size to a value that you can

×