418 Hands-On Microsoft SQL Server 2008 Integration Services
a randomly selected 15 percent of the total records in the segment. One other example
could be that you need to create a training data set and a test data set for your data
mining models and you want to divide your data set based on a defined percentage. In
such a case, you can use the Percentage Sampling transformation to create a sample
data set from the input rows using the specified percentage. This transformation helps
you create a representative data set much smaller in size that you can use for variety
of purposes, such as testing your packages in a development environment or using the
sample data set for surveys and marketing purposes.
The Percentage Sampling transformation has a simple and intuitive user interface with
just four fields. The first field, Percentage Of Rows, allows you to specify a percentage
of sampling. The Percentage Sampling transformation uses an algorithm to select at
random the number of rows according to the specified percentage. However, the number
of rows that this transformation selects does not precisely match with the percentage
calculations—i.e., the output rows may be a little bit too many or a little bit too few. The
next two fields allow you to specify the names for selected sample output rows and the
remaining unselected output rows. As this transformation selects rows for sampling, it
outputs those rows onto its first output and the remaining unselected rows are outputted
on to its second output. This transformation supports one input and two outputs to
support both the selected and unselected data sets and supports no error output.
The last field is a check box and a value pair, you can select the check box if you
want to specify a sampling seed and type a value in the field provided as the sampling
seed. If you specify a sampling seed and reuse the same sampling seed in a later run, it
will produce the same sample output no matter how many times you run the package
with the same data set. This is helpful in testing of packages. Alternatively, if you don’t
specify a sampling seed, this component will generate a random number using the tick
count of the operating system. Hence, each time you run a package, a different random
number is generated and a different data set is sampled.
Row Sampling Transformation
The Row Sampling transformation works quite similar to the Percentage Sampling
transformation to sample a data set. However, the Row Sampling transformation
outputs an exact number of rows as specified in the transformation. This random
selection of a precise number of rows is sometimes very useful. An example of such
a scenario can be a gift allocation to the random selection of people. Suppose you’re
running a campaign to introduce your new product to different segments of your
customers and prospects by sending them an e-mail every week. To promote readers’
interest, you decide to award gifts to a random selection of 50 persons who show
interest in your product by evaluating it every week. You can easily build this package
by bringing into the data flow the records for the persons who evaluated the product in
Chapter 10: Data Flow Transformations 419
the current week and then apply a Row Sampling transformation to select 50 persons
out of these records.
In the Row Sampling Transformation Editor, you specify the number of rows
you want to output. This transformation supports two outputs—one for extracting
the selected records and the other for unselected records. You can type in the names
for both the outputs in the user interface. It is not necessary for you to configure a
downstream data flow to capture the unselected records. You can simply ignore this
output, and the records appearing on this output will not be included in the data flow.
You can choose to specify a random seed for selection of records by clicking in
the “Use the following random seed” check box. This transformation selects random
records on the basis of an algorithm that uses the random seed. If you specify the same
random seed, the algorithm will select the same random records for the same input
data. When you check this option, a message will pop up to tell you that using the same
random seed on the same input data always generates the same sample, and specifying
a random seed is recommended only during the development and testing of a package.
Specifying a random seed affects the selection of records, and when you don’t specify
a random seed, the transformation uses the tick count of the operating system to create
the random number that is obviously different each time you run the package, and
hence the selected random records will be different even for the same input data.
Pivot Transformation
Relational databases are modeled to store normalized data. This normalization of data
changes the data view in a way that sometimes may not be as intuitive as businesses
desire. The process that is used to convert data from a normalized form to a denormalized
form is called pivoting. To understand how you can use this transformation, you will be
working through a Hands-On exercise later; but for now, let’s discuss what a normalized
data is by looking at the data we are going to use.
The data you will use in an exercise later is in an Excel spreadsheet that keeps sales
order details in three columns: SalesOrderID, ProductName, and OrderQuantity. To
keep the data in a normalized form, the table contains multiple entries or rows for the
same SalesOrderID. For example, if three products have been purchased under a single
SalesOrderID, the normalized data is represented by listing three rows for the same
SalesOrderID with a different ProductName in each row to show the purchase quantity
for that product. But the sales manager may prefer to see the sales order details with the
products and the quantity for each of them listed against the SalesOrderID on the same
row. This is when you need to use pivot function to denormalize the data.
Before SSIS made it available, the pivot function was available in Microsoft Excel or
third-party tools, or you had to write custom code to accomplish the task. Integration
Services now provides both a Pivot transformation and an UnPivot transformation to
420 Hands-On Microsoft SQL Server 2008 Integration Services
provide different data views or forms. The Pivot transformation converts a normalized
data set into a less normalized form by pivoting the input data on a column value. The
role a column performs in pivoting is defined by the PivotUsage property specifying
values ranging from 0 to 3. The column that is used to pivot the data around forms the
set key for pivoting, and this column is assigned a value of 1 for the PivotUsage property.
Assigning a value of 1 to the PivotUsage property of an input column indicates that it is part
of the set key of a single-row or multirow set. In our example, the SalesOrderID column
will act as a set key for pivoting and the multiple rows with the same SalesOrderID will
be combined into one row.
When the data is pivoted, the values in a column, called pivot key values, are pivoted
to the columns in the output. For specifying the values to be pivoted to output columns,
you assign a value of 2 to the PivotUsage property on the input column. In our example,
the ProductName column has 10 different values for the products. When this data is
pivoted, 10 columns are created on the basis of 10 different values in the ProductName
column—e.g., the value Mountain–100 of the ProductName column becomes the
Mountain–100 column in the pivoted output. These newly created columns in the
pivoted output get the values from the third column, which is used to provide values for
newly created columns in the pivoted output by specifying a value of 3 to the PivotUsage
property. All other input columns that don’t participate in the pivoting process are
assigned PivotUsage value of 0, and for the set of input rows that has same set key, the
first input value for the column is copied to the output column.
The Pivot transformation pivots the data on the basis of set key column value. For the
same value of set key, the Pivot transformation merges multiple rows into a single row and
pivots the input rows into columns. This implies that if the data is not sorted to list the
same set key values in one collection of rows, this transformation will output the same key
values multiple times. In our example, to get only one record for a SalesOrderID, the data
must be sorted on SalesOrderID. However, if the data is not sorted on SalesOrderID, this
transformation will generate multiple records for same SalesOrderID, as it will pivot the
rows to columns each time the value of SalesOrderID changes.
All this may appear quite complex, but it is not that complicated when it comes to
configuring the Pivot transformation. Let’s see how to use this transformation to pivot
data from an Excel worksheet.
Hands-On: Pivoting Sales Order
Records in an Excel Worksheet
The records exported from the Sales order database to an Excel worksheet are in the
normalized form—i.e., one sales order number appears in multiple rows to store details
for the products ordered against it. The sales manager wants to see details of all the
products ordered against each sales order in a single row.
Chapter 10: Data Flow Transformations 421
Before starting this exercise, open the C:\SSIS\RawFiles\SalesOrders.xls file to
verify that the file has only one worksheet labeled Normalized. This exercise adds
another worksheet to this file; if it already has two worksheets, delete the second
worksheet and then start this exercise. Also, if you are using the provided package code,
you may get a validation error, as the Excel Destination used in the package looks for
the worksheets during this exercise. In this case, leave the worksheets as is.
Method
In this exercise, you will be using a Pivot transformation to transform the given data
to the required format and will put the pivoted data in a new worksheet. As Pivot
transformation pivots the data every time the set key column value changes, you will
need to sort the data before the Pivot transformation.
Exercise (Add Connection Manager and Data Flow Task)
You will start this exercise with adding a new package to the Data Flow transformations
project, and then adding an Excel Connection Manager to it.
1. Open the Data Flow transformations project in BIDS. Right-click the SSIS
Packages in the Solution Explorer and choose New SSIS Package. This will add
a new SSIS package called Package1.dtsx.
2. Rename the Package1.dtsx package to Pivoting SalesOrders.dtsx.
3. Right-click in the Connection Managers area and choose New Connection
from the context menu. Select the Excel Connection Manager type from the
list in the Add SSIS Connection Manager dialog box and click Add. Next, Type
C:\SSIS\RawFiles\SalesOrders.xls in the Excel file path field in the Excel
Connection Manager dialog box. Leave the Excel Version selected as Microsoft
Excel 97-2003 and see that the check box for First Row has column names
checked. Click OK to add the Excel Connection Manager in the Connection
Managers area. Rename it as SalesOrders Connection Manager.
4. Drag the Data Flow Task from the Toolbox and drop it onto the Control Flow
Designer surface. Rename this task Transforming SalesOrders. Double-click it
to open the Data Flow tab and configure this task.
Exercise (Configure the Data Flow Task)
To configure a data flow for pivoting SalesOrders data, here you will add an Excel source
to extract data from a normalized worksheet of the SalesOrders.xls file and then sort this
data on the SalesOrderID, as a Pivot transformation requires all the rows having the same
set key to be together in a sorted set for merging them to a single row. After sorting the
data, you will configure the Pivot transformation to pump the pivoted data through to an
Excel destination that will write the pivoted output to a new Excel worksheet.
422 Hands-On Microsoft SQL Server 2008 Integration Services
5. From the Toolbox, drag and drop the Excel source onto the Data Flow Designer
surface. Rename this adapter Normalized Data Source. Double-click the
Normalized Data Source to open the Excel Source Editor. The SalesOrders
Connection Manager will be listed in the OLE DB Connection Manager field
for you. Select the Normalized$ in the “Name of the Excel sheet” field.
6. Go to the Columns page, and verify that all the three fields have been selected
from the Available External Columns. Click OK to close the editor.
7. Drop the Sort transformation from the Toolbox onto the Data Flow Designer
surface just below the Normalized Data Source and join both the components
using green connector. Rename it as Sort on SalesOrderID and double-click to
open the Sort Transformation Editor. Click to select the check box before the
SalesOrderID column. This column will appear in the lower half of the dialog
box with Sort Type as ascending and Sort Order equal to 1. Leave these setting
as is and click OK to close this editor.
8. Drop the Pivot from the Toolbox onto the Designer surface just below the Sort
on SalesOrderID. Connect the two transformations using the green arrow.
Double-click the Pivot to open the Advanced Editor for Pivot, as Pivot doesn’t
have a custom UI. In the Component Properties tab, change the Name field to
Pivot on ProductName.
9. Go to the Input Columns tab and select all three columns.
10. Move on to the Input and Output Properties tab. Expand the Input Columns
under Pivot Default Input on the left side of the dialog box to reveal the three
columns you have selected in the Input Columns tab. Click the SalesOrderID
to list the properties of this column on the right pane of the dialog box. Scroll
down in the properties to locate the PivotUsage property. Assign a value of 1 to
PivotUsage, indicating that this field will be treated as a set key. Now, click the
ProductName column and assign a value of 2 to its PivotUsage property to indicate
that this field is a pivot key field and the distinct values in this field will create
corresponding columns in the output. Next, click the OrderQuantity column and
assign a value of 3 to its PivotUsage property to indicate that the values from this
field will be populated in the columns generated by the pivot key column.
11. Expand Pivot Default Output and then click the Output Columns. You will
notice that no output column appears in this transformation yet. Click Add
Column and rename the newly added column SalesOrderID. This output
column needs to be linked to an input column. In the properties of this column,
locate the SourceColumn property. This property holds the lineage identifier
of an input column and tells the Pivot transformation to populate the output
column using values from the specified input column. To specify the source in
this column, click the SalesOrderID column under Input Columns and note the
LineageID (not the ID). Specify this value of LineageID in the SourceColumn
property of SalesOrderID output field. The value 32 shown in the SourceColumn
property in Figure 10-10 is the LineageID of the SalesOrderID input column.
Chapter 10: Data Flow Transformations 423
12. As you did in Step 11, add 10 more columns in the Output Columns and
rename them as per the following table. Also, assign values to the PivotKeyValue
and SourceColumn properties of the new output columns. Remember that
PivotKeyValue is one of the distinct values of pivot key column and the
SourceColumn indicates from which input column the data will populate the
concerned output column. The new columns added here will be populated by
Figure 10-10 Setting LineageID on the SourceColumn property
424 Hands-On Microsoft SQL Server 2008 Integration Services
OrderQuantity input column, so the LineageID of OrderQuantity will be used to
populate all of them.
Output Column Renamed to PivotKeyValue SourceColumn
Column Mountain-100 Mountain-100 LineageID Value of OrderQuantity column
Column1 Mountain-200 Mountain-200 LineageID Value of OrderQuantity column
Column2 Mountain-300 Mountain-300 LineageID Value of OrderQuantity column
Column3 Road-150 Road-150 LineageID Value of OrderQuantity column
Column4 Road-250 Road-250 LineageID Value of OrderQuantity column
Column5 Road-350 Road-350 LineageID Value of OrderQuantity column
Column6 Touring-1000 Touring-1000 LineageID Value of OrderQuantity column
Column7 Touring-2000 Touring-2000 LineageID Value of OrderQuantity column
Column8 Touring-3000 Touring-3000 LineageID Value of OrderQuantity column
Column9 Sport-100 Sport-100 LineageID Value of OrderQuantity column
Refer to Figure 10-11 to see how this will look. In the figure, the value 42 in the
SourceColumn is the LineageID of OrderQuantity input column. Click OK to
close this transformation.
13. Drop the Excel destination from the Toolbox just below the Pivot on ProductName
and connect the two transformations using the green arrow. Rename it as Pivoted
Data Destination. Double-click the Pivoted Data destination to open the Excel
Destination Editor. You will use the same connection manager and the same Excel
workbook. But you will add a new worksheet called Pivoted in the Excel workbook
to store the pivoted data. Click the New button opposite the “Name of the Excel
sheet” field and you will see a script to create a new worksheet in the Excel workbook.
Change the Pivoted Data Destination just after CREATE TABLE statement to
Pivoted only. This will create a new worksheet named Pivoted in the SalesOrders.xls
file. Click OK and select the Pivoted sheet in the “Name of the Excel sheet” field.
14. Go to the Mappings page. As you click Mappings, you should see all the mappings
created for you automatically. Click OK to close the editor for this component.
Exercise (Add Data Viewers and Execute the Package)
After having configured all the data flow components, you are ready to execute the
package. However, in this exercise, you will also add data viewers before and after the
Pivot transformation to see how the data has been pivoted.
15. Double-click the data flow path connecting Sort on SalesOrderID and Pivot
on ProductName. In the Data Flow Path Editor, click the Data Viewers page,
and then click Add and add a grid type data viewer. Click OK twice to close the
editor window.
Chapter 10: Data Flow Transformations 425
16. Similarly, add another grid type data viewer on the data flow path between Pivot
on ProductName and Pivoted Data Destination.
17. Press 5 to execute the package. As the Normalized Data Source extracts data and
passes to the Sort on SalesOrderID, you can see Normalized Data Source turns
green, indicating a successful extraction of data and a total of 2,187 records extracted.
The next two components will appear in yellow, as the data viewer is holding the
execution process for you to check the data. Click the Detach button on the first data
Figure 10-11 Configuring output columns for the Pivot transformation
426 Hands-On Microsoft SQL Server 2008 Integration Services
viewer to let the execution proceed. As you detach the first data viewer, you will see
the second data viewer pop up with populated data and the Pivot on ProductName
outputting 1,302 rows. As the data passes the Sort on SalesOrderID, it will turn
green to indicate a successful sort operation (see Figure 10-12). When you are
done checking the data, click the Detach button on the second data viewer and let
the package complete successfully.
18. Stop debugging by pressing -5. Save and close the project. Open the
SalesOrders.xls file and check out the Pivoted worksheet to see how the data has
been pivoted.
Review
You’ve used a Pivot transformation to convert normalized data to a less normalized form.
During this exercise, you sorted the data before sending it to the Pivot transformation
Figure 10-12 Data viewers showing data before and after Pivot transformation
Chapter 10: Data Flow Transformations 427
and created output columns for the pivoted data. Finally, you created a new Excel
worksheet using an Excel destination. Note that you’ve used only one connection
manager to connect to Excel file for both extractions of data using a source adapter
and loading of data using a destination adapter.
Unpivot Transformation
This transformation works in an opposite way to the Pivot transformation and converts
a denormalized data set into a more normalized version—i.e., one row may be broken
down into multiple atomic rows so that they can be stored in a relational database.
We will use the data you derived in the last exercise, in which each row lists multiple
products against a single SalesOrderID. When you run this transformation, the records
will be broken up into multiple rows containing the same value for SalesOrderID, but
having only one product in each row. To support its functions, this transformation uses
one input, one output, and an error output.
This transformation has a custom user interface that is much simpler than that of the
Pivot transformation. To get an idea of how to configure this transformation, consider
the pivoted data that you created in the preceding exercise. To get the data formatted
in the normalized form—i.e., from where you started in the last Hands-On—you will
configure the Unpivot transformation, as shown in Figure 10-13.
Following is the step-by-step method you will use to get the pivoted data back to
normalized form:
1. Use the Excel source to bring the Pivoted worksheet data in the data flow.
2. Add an Unpivot transformation and configure it. Let the set key column pass
through the transformation as is, which is SalesOrderID in this case. Then
select all the columns that you want to unpivot. As you select the check boxes for
Available Input Columns, the Input Column and the Pivot Key Value columns
will be filled in using the column name selected. After that, manually fill in
the Destination Column name where you want the values of input columns
to be populated, which is OrderQuantity in this example. Last, specify the
column name in which you would like pivot key values to be populated. These
configurations indicate that the input column names will be converted into the
values specified in the Pivot Key Value, which will then be populated in pivot
key value column (ProductName in this case), and input column values will
be populated in a new column specified in the Destination Column, which is
OrderQuantity column in this example.
3. Finally, add an Excel destination to collect the normalized data in an Excel
worksheet.