Tải bản đầy đủ (.pdf) (10 trang)

Hands-On Microsoft SQL Server 2008 Integration Services part 37 docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (237.82 KB, 10 trang )

338 Hands-On Microsoft SQL Server 2008 Integration Services
data into the Integration Services data type. During design time, when you drop the
data flow source onto the Data Flow Designer surface and configure it to connect
to an external data source from which data will be read, the data flow source copies
the metadata—i.e., the schema of the data—to its external columns metadata. At
run time, when the data is being pulled in, the incoming data columns are parsed
into the Integration Services data types using the external columns metadata. In
the example, the names Steve and Sarah are parsed into the DT_STR data type,
sex, which is indicated using 1 and 0, and have been assigned DT_BOOL; House_
Number has been assigned DT_I4; and the Moved_in data has been assigned the
DT_ DBTIMESTAMP data type. Integration Services provides 29 different data
types to cover various types of data, such as character data, numerical data, Boolean,
dates, text, and image fields. Integration Services provides a wide range of data
types and is particular about how they are used; hence, if data has a data type that
does not match with the data types available in Integration Services, an error occurs.
We will explore many data types along the way as we progress with our Hands-On
exercises; however, to know more about each data type, refer to Microsoft SQL
Server 2008 Books Online.
The process that converts the source data into Integration Services data types is
called data parsing. Data flow components can be configured to use either fast parsing
or standard parsing. Fast parsing supports the most commonly used data types with
a simple set of routines and can be configured at the column level. Because it uses
simple routines and does not check for many other data types, it can achieve high levels
of performance. However, it is not available for most of the data flow components
Person
Steve
1
250
07-11-2005
Sarah
0


130
25-21-2004
DT_STR
DT_I4
DT_DBDATE
DT_BOOL
Steve
1
250
07-11-2005
Processing logic
Error output
Output
Sarah
0
130
25-21-2004
46
-1071607694
External columns
metadata
External data source
Parsing
Data flow source
Name
Sex
House_Number
Moved_in
varchar(50)
bit

int
datetime
Figure 9-1 A data flow source extracting and parsing data
Chapter 9: Data Flow Components 339
and can parse only a narrow range of data types. For example, fast parsing does not
support locale-specific parsing, special currency symbols, and date formats other than
year-month-date. Also, fast parsing can be used only when using a Flat File source,
a data conversion transformation or derived column transformation, and a Flat File
destination, because these are the only components that convert data between string
and binary data types. You may use fast parsing for performance reasons when your
data meets these requirements, but for all other occasions, you will be using standard
parsing. Before using fast parsing, check out the Books Online to find out the data
types supported by the fast parsing.
Standard parsing uses a rich set of parsing routines that are equivalent to OLE DB
parsing APIs and supports all the data type conversions provided by the automation
data type conversion APIs available in Oleaut32.dll and Ole2dsip.dll. For example,
standard parsing provides support for locale-sensitive parsing and international data
type conversions.
Returning to the data flow, as the data rows arrive and are parsed, they are validated
against the external columns metadata. The data that passes this validation check at
run time is copied to the output columns; the data that doesn’t pass is treated slightly
differently. You can define the action the component can take when a failure occurs.
Typical errors you can expect at the validation stage include data type mismatches
or a data length that exceeds the length defined for the column. Integration Services
handles these as two different types of errors—data type errors and data length or data
truncation errors. You can specify the action you want the data flow component to take
for each type of error from the three options—fail the component, ignore the error, or
redirect the failing row to error output fields. You can specify one of these actions on
all columns or different actions for different columns. If you redirect the failing rows to
the error output fields and link the error output to a data flow destination, the failing

rows will be written to the data flow destination you specified.
The error output contains all the output columns and two additional columns for the
failing rows, ErrorCode and ErrorColumn, which indicate the type of error and the failing
column. In Figure 9-1, note that the record holding data for Sarah has a wrong date
specified and hence fails during extract process. As the source was configured to redirect
rows, the failing row data is sent to the Error Output. Also, note that the two rightmost
columns indicate the type of error and the column number. Every output field is assigned
an ID automatically, and the number shown in the ErrorColumn is the ID number of the
column failing the extract process. If two columns fail on the same row, the column that
fails first is captured and reported. So, in case of multiple failures on a row, you might not
know about them till the package fails again after you have fixed the first error.
After pulling in the data rows, a data flow source passes the output rows to the next
data flow component—generally a transformation—and the failing rows are passed to
another data flow component—which could be a data flow destination, if configured
340 Hands-On Microsoft SQL Server 2008 Integration Services
to redirect the failing rows. You can also redirect the failing rows to an alternate branch in
the data flow via a transformation in which you can apply logic to correct the failing rows
and bring them back into the main data flow after corrections. This is a highly useful
capability that, if used properly, can reduce wastage and improve data quality. When the
output of a data flow source is connected to the input of a data flow transformation,
the data flows from the input of the transformation, through the processing logic of
transformation, and then to its output. Based on the logic of transformation, some rows
may fail the process and may be sent to the error output. The main difference to note in
comparison to a data flow source is that the transformations do not have external columns
and have input columns instead. Figure 9-2 shows the functional layout of a data flow
transformation.
Finally, after passing through the data flow source and the data flow transformations,
the data will reach a data flow destination so that it can be written to an external data store.
Like data flow sources, a data flow destination can also read the schema information from
Processing logic

Error outputOutput
Input
Data flow transformation
ErrorCode
ErrorColumn
Name
Name
Sex
Sex
House number
House_Number
Moved_In
Moved_In
Name
Sex
House_Number
Moved_In
Figure 9-2 Data flow transformation showing Input, Output, and Error Output columns
Chapter 9: Data Flow Components 341
the external data store and copy the metadata to the external columns. When the data
flows through the data flow destination, it gets validated against this metadata before being
written to the external data store. If configured to redirect the failing rows, the failing rows
may be sent to the error output columns and the rest of the successful data is written to
the external data store. Note that a data flow destination does not have output columns.
Figure 9-3 shows the data flow through a data flow destination.
At this point it is worth mentioning that a pipeline or a data flow path usually
terminates in a data flow destination; however, that is not necessary. Sometimes, you
may prefer to terminate a data flow path in a transformation. For example, while testing
and debugging, you can break and terminate the data flow path by adding a Row Count
transformation to see what happens at a particular point in the data flow path while

ignoring rest of the pipeline. With the preceding description of how the data is extracted
from a data source, and how it flows through the transformations and destinations
before being written to the external data store, let’s study these components to learn
more about them and to learn how many types of data sources or destinations are
available in the data flow.
Processing logic
Error output
Data flow destination
1
170
External columns
metadata
DT_STR
DT_I4
DT_DBDATE
DT_BOOL
Steve
1
250
07-11-2005
Parsing
Input
Christopher
05-02-2005
ErrorCode
ErrorColumn
1
170
250
07-11-2005

05-02-2005
Steve
1
Christopher
Person
External data source
Name
Sex
House_Number
Moved_in
varchar(10)
bit
int
datetime
Figure 9-3 Data flow destination showing the flow of data to the external data store
342 Hands-On Microsoft SQL Server 2008 Integration Services
Data Flow Sources
While building a data flow for a package using BIDS, your first objective is to bring the
data inside the Integration Services data flow so that you can then modify the data using
data flow transformations. Data flow sources are designed to bring data from the external
sources into the Integration Services data flow. A data flow source reads the external data
source, such as a flat file or a table in relational database, and brings in the data to the
SSIS data flow by passing this data through the output columns on to the downstream
component, usually a data flow transformation. During design time, the data flow source
keeps a snapshot of the metadata of external columns and can refer to it at run time. If
the data in certain rows doesn’t match with this schema at run time, the data sources can
be configured to redirect those rows to the error output columns that can be dealt with
separately. Integration Services provides six preconfigured data flow sources to read data
from a variety of data sources, plus a script component that can also be scripted as a data
flow source. If existing data flow sources do not meet your requirements, you can always

build yourself a custom data flow source using the Integration Services object model.
Scripting a data flow source has been discussed in Chapter 11.
The following two tables list the preconfigured data flow source adapters and the
interfaces they have.
Source Description
Extracts data from .NET Framework data providers, such as SQL Server 2008 using ADO.NET
Connection Managers.
Extracts data from an Excel worksheet using Excel Connection Manager.
Reads data from a text file using Flat File Connection Manager.
Extracts data from OLE DB–compliant relational databases using OLE DB Connection Manager.
Extracts data from a raw file using a direct connection.
Hosts and runs a script that can be used to extract, transform, or load data. Though not shown
under data flow sources in the Toolbox, the Script Component can also be used as a data flow
source. This component is covered in Chapter 11.
Reads data from an XML data source by specifying the location of the XML file or a variable.
Chapter 9: Data Flow Components 343
Data Flow Source Input Output Error Output Custom UI Connection Manager
ADO NET source No 1 1 Yes ADO.NET Connection Manager
Excel source No 1 1 Yes Excel Connection Manager
Flat File source No 1 1 Yes Flat File Connection Manager
OLE DB source No 1 1 Yes OLE DB Connection Manager
Raw File source No 1 No Yes Not required
XML source No 1 1 Yes Not required
From the preceding table, you can make out that not all data flow sources have Error
Output and not all data flow sources require a connection manager to connect to the
external data source; rather, some can directly connect to the data source such as XML
Source. But the important thing to understand here is that data flow sources don’t
have an input. They use the external columns interface to get the data and use output
columns to pass the data to downstream components. We will study more about each
of the data flow sources in the following topics.

ADO NET Source
When you need to access data from a .NET provider, you use an ADO.NET Connection
Manager to connect to the data source and then can use the ADO NET source to
bring the data inside the data flow pipeline. You can configure this source to use
either a table or a view or use an SQL command to extract rows from the external
.NET source. The ADO NET source has one regular output and one error output.
The ADO NET source has a custom UI though you can also use advanced editor to
configure some of the properties that are not exposed in the custom UI. The custom
UI has three pages to configure—Connection Manager, Columns, and Error Output.
Connection Manager
c Specify the ADO.NET Connection Manager here
that the ADO NET Source uses to connect to the data source. Select one of the
ADO.NET Connection Managers already configured in the package from the
drop-down list provided under the ADO.NET Connection Manager field, or you
can use the New button to add a new ADO.NET Connection Manager.
Columns
c is page shows you the list of columns read from the external .NET
data source and cached into the External Columns interface. It also shows you
the corresponding Output Columns that, by default, have the same names as the
cached schema in the External Columns. You can change the Output Column
names here if you want to call them differently inside your pipeline.
344 Hands-On Microsoft SQL Server 2008 Integration Services
Error Output c When outputting the columns read from the external data
source, some rows may fail due to wrong data coming through. ese failures can
be categorized as errors or truncations. Errors can be data conversion errors or
expression evaluation errors. e data may be of wrong type—i.e., alphabetical
characters arriving in an integer field—causing errors to be generated. Truncation
failures may not be as critical as errors—in fact, sometimes they are desirable.
Truncation lets the data through, but truncates the data characters for which the
length becomes more than the specified length—for example, if you specify the

city column as VARCHAR(10), then all the characters after first ten characters
will be truncated when it exceeds the ten-character length. You can configure this
component for data type errors or truncations of data in columns to fail, ignore the
error, or redirect the failing row to error output. See Figure 9-4 for an explanation.
Figure 9-4 Setting error conditions for data type mismatches and truncation errors
Chapter 9: Data Flow Components 345
Fail Component c is is the default option for both errors and truncations
and will fail the data flow component, ADO NET Source in this case, when
an error or a truncation occurs.
Ignore Failure
c You can ignore the error or the truncation and carry on
outputting the data from this component.
Redirect Row
c You can configure the ADO NET Source to redirect the
failing row to the error output of the source adapter, which will be handled by
the components capturing the error output rows.
If you have many columns to configure for different settings, you may find using the
Set This Value To Selected Cells field easier to apply a value to all of your selected cells.
Excel Source
When you need to work with data in Excel files, you can either use the OLE DB
Source with Microsoft Jet 4.0 OLE DB Provider or simply use an Excel source to
get the data into the pipeline. You will configure the Excel Connection Manager to
connect to an Excel workbook and then use that connection manager inside an Excel
source to extract the data from a worksheet and bring it in to the pipeline. You can
treat an Excel workbook as a database and its worksheets as tables while configuring
the Excel source. A range in the Excel workbook can also be treated as a table or a view
on database. The Excel source adapter has one regular output and one error output.
This component has its own user interface, though you can also use Advanced
Editor to configure its properties. When you open the Excel Source Editor in the data
flow designer you will see that this source adapter also has three pages to configure.

Connection Manager
c e Connection Manager page opens by default, where
you can select the connection manager from the drop-down list in the OLE DB
Connection Manager field. e Data Access Mode drop-down list provides four
options. Depending upon your choice in the Data Access Mode field, the interface
changes the fields to provide relevant information.
Table or view
c is option treats the Excel sheet as a table or a view and
extracts data from the specified worksheet in the Name Of e Excel Sheet field.
Table name or view name variable
c When you select this option, the
subsequent field changes to Variable Name field. is option works similar
to the Table Or View option, except instead of expecting the name of the
worksheet or Excel range to be specified directly, it lets you specify the name
of a variable in the Variable Name field, from which it can read the name of
the Excel range or the worksheet.
346 Hands-On Microsoft SQL Server 2008 Integration Services
SQL Command c Selecting this option in the Data Access Mode field
changes the interface to let you provide an SQL statement in the SQL
Command Text field to access data from an Excel workbook. You can either
type in SQL directly or use the provided Build Query button to build an SQL
query. (I recommend you to use this query builder to access data from the
Excel sheet, even if you know how to write complex SQL queries for SQL
Server, because there are some lesser-known issues on accessing data from
Excel workbooks using SQL.)
You can also use a parameterized SQL query, for which you can specify parameter
mappings using the Parameters button. When you click Parameters, you get
an interface that lets you map a parameter to a variable.
SQL Command From Variable
c is option works as an SQL command,

except it reads the SQL statement from a variable specified in the Variable
Name field.
Columns
c You can map external columns to output columns in this page and can
change the names of output columns.
Error Output
c As explained in the ADO NET Source, you can configure this
page to fail the component, ignore the error, or redirect the row in case an error
occurs in a data column.
While the Excel Source Editor allows you to configure the properties for Excel
source, you may need to use the Advanced Editor to configure the properties not
exposed by the Excel Source Editor. These properties include assigning a name and
description to the component, specifying a timeout value for the SQL query, or most
important, changing the data type for a column. While working with the Advanced
Editor, get acquainted with the various options available in its interface.
Flat File Source
The Flat File source lets your package read data from a text file and bring that data into
the pipeline. You configure a Flat File Connection Manager to connect to a text file
and specify how the file is formatted. Also, you will specify the data type and length of
each column in the Flat File Connection Manager that will set guidelines for the Flat
File source to handle it appropriately. The Flat File source can read a delimited, fixed
width, or ragged right–formatted flat file. To know more about these file types, refer to
Chapter 3.
The Flat File source has a custom user interface that you can use to configure its
properties. Also, as with the Excel source adapter, its properties can be configured using
an Advanced Editor. When you open the Flat File Source Editor, the Connection
Chapter 9: Data Flow Components 347
Manager page opens up by default. You can select the Connection Manager from
the drop-down list provided in the Flat File Connection Manager field. The flat files
contain nothing for the null values, and if you want to keep these null values, check the

box for “Retain null values from the source as null values” in the data flow option. By
default, this check box is unchecked, which means the Flat File source will not keep
null values in the data but will replace null values with the appropriate default values for
each column type—for example, empty strings for string columns and zero for numeric
columns. Note that the file you are trying to access must be in a delimited format. This
is because the fixed width and/or ragged right–format files do not contain blank spaces;
you need to pad the fields with a padding character to the maximum width so that
the data cannot be treated as null values by the Flat File source adapter. These format
settings for the flat file are done in the Flat File Connection Manager.
The Columns and Error Output pages can be configured as described in the Excel
source adapter. You can use Columns page on the Flat File source to map external
columns to the output columns and the Error Output page to configure the error and
truncation behavior when the mismatched data comes along, using the three options of
Fail Component, Redirect Row, or Ignore Failure.
This source has two important custom properties—FastParse and UseBinaryFormat—
that are exposed in the Advanced Editor. The configurations for both these properties
are done in the Input and Output Properties tab of the Advanced Editor for the Flat
File source. Depending on the data you are dealing with, you can set the FastParse
option for each of the columns by selecting the column in the Output Columns section
and then going to the Custom Properties category of the column properties. By default,
the FastParse option is set to False (see Figure 9-5), which means the standard parsing
technique will be used. (Remember that standard parsing is a rich set of algorithms that
can provide extensive data type conversions, where as fast parsing is relatively simplified
set of parsing routines that supports only the most commonly used data and time formats
without any locale-specific data type conversions.)
The second property, UseBinaryFormat (also shown in Figure 9-5), allows you to let
the binary data in the input column pass through to the output column without parsing.
Sometimes you have to deal with binary data, such as data with the packed decimal
format, especially when you’re receiving data from a mainframe system or the data source
is storing data in the COBOL binary format—for example, you might be dealing with

IBM EBCDIC–formatted data. In such cases, you may not prefer the Flat File source
to parse the data; rather, you would like to parse it separately using special rules that
are built based on how it has been packed into the column in the first place. By default
UseBinaryFormat is set to false, which means the data in the input column will be parsed
using Flat File source parsing techniques. To use this property, set UseBinaryFormat to
true and the data type on the output column to DT_BYTES, to let the binary data be
passed on to the output column as is.

×