Tải bản đầy đủ (.pdf) (10 trang)

Hands-On Microsoft SQL Server 2008 Integration Services part 50 docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (712.53 KB, 10 trang )

468 Hands-On Microsoft SQL Server 2008 Integration Services
Hands-On: Removing Duplicates from Owners Data
The Fuzzy Lookup and Fuzzy Grouping transformations are the data flow components
that you can use for data cleaning purposes. One of the main issues with data quality is
to remove duplication in data, whether it occurs at loading time or it already exists in
the data that you want to cleanse. Until now, you’ve worked on a couple of instances to
remove duplicates in the earlier Hands-On exercises, but those instances were dealing
with exact duplicates. In this exercise, you will deal with fuzzy duplication of data. You
will work with these components to remove exact as well as fuzzy duplicates from the
input data.
The scenario is that you are maintaining an Owner table that contains contact
details for the owners of your products. You regularly receive an Owner’s data feed that
sometimes contains duplicate data. This duplicate data is not consistent, as users tend
to provide their details differently at different occasions. You need to make sure no
duplicate record is added to the Owner table.
Method
You have the Owner table in the Campaign database and receive OwnersFeed.xls files
regularly that contain owner records. This Excel file can contain duplicate records for
the same person. The complication, however, is that these records may not be exact
duplicates, as persons provide their contact details differently at different occasions.
Our sample OwnersFeed.xls file contains 13 records, of which 3 are unique records and
the other 2 records have five variants each, with different name spellings and address
details; one of these 2 records already exists in the Owner table. Open the OwnersFeed
.xls file to have a look at the incoming data (see Figure 10-25).
Figure 10-25 Incoming data contains variants of duplicate records
Chapter 10: Data Flow Transformations 469
The owner with first name Johnathon already exists in the table and has five
different variants of contact details. The other owner with first name Kathrine is a
new owner, but has five different variants of contact details in the feed. You want to
load only four records into the Owner table consisting of three unique records plus
an owner record for Kathrine, while all the duplicate variants and the owner records


for Johnathon should be removed. You will remove the exact duplicates from the
OwnersFeed file using the Sort and Lookup transformation. After removing exact and
matched duplicates, you will use the Fuzzy Lookup transformation to remove fuzzy
matched duplicates. At this stage only the duplicate rows that exist in variant forms in
the pipeline would be left. You will remove these duplicate variants using the Fuzzy
Grouping transformation.
Exercise (Create Removing Duplicate Owners Package)
Start by creating a new package, adding the Data Flow task into this package, and then
adding an Excel source to get the data in the data flow.
1. Open the Data Flow transformations project in BIDS. Add a New SSIS Package
in the project and then rename it as Removing Duplicates.dtsx.
2. Drop a Data Flow task from the Toolbox onto the Control Flow surface and
rename it Removing Duplicates from OwnersFeed.
3. Double-click the Removing Duplicates from OwnersFeed task to go to the Data
Flow panel. Drop an Excel source from the Toolbox onto the Data Flow surface.
Rename the Excel source OwnersFeed.
4. Double-click OwnersFeed to open the Excel Source Editor. Click the New
button next to the OLE DB Connection Manager field. Type C:\SSIS\RawFiles\
OwnersFeed.xls in the Excel File Path field. Leave the “First row has column
names” option checked. Click OK to add an Excel Connection Manager and
return to the Excel Source Editor. Leave Table Or View selected in the Data
Access Mode field. Click in the Name of the Excel sheet field and select Owners$
from the drop-down list.
5. Go to the Columns page and verify that all the columns have been selected. Click
OK to close the Excel Source Editor.
6. Take an opportunity to rename the Excel Connection Manager as OwnersFeed
Connection.
7. Right-click anywhere on a blank surface in the Connection Managers area and
choose New Connection from the context menu. Select Excel from the Connection
Manager Type list and click Add. This will open the Excel Connection Manager

dialog box. Type C:\SSIS\RawFiles\DuplicateOwners.xls in the Excel File Path
field. Leave the First Row Has Column Names check box selected and click OK
to add this connection manager. Once it is added, rename this connection manager
DuplicateOwners Connection.
470 Hands-On Microsoft SQL Server 2008 Integration Services
Exercise (Remove Exact Duplicates)
If you look at the data in the Figure 10-25, you will see that Johnathon Skinner
and Kathrine Morris have duplicate records in the OwnersFeed. You will use a Sort
transformation to remove these duplicates. Then you will check the OwnersFeed
against the Owner table in the database to see whether any of the record exists in the
Owner table. Johnathon Skinner also exists in the Owner table and will be removed
from the input data flow using a Lookup transformation. The other variants of these
owners will remain unaffected and will have to be dealt with separately.
8. Drop the Sort transformation onto the Data Flow surface and join OwnersFeed
with it using the green path.
9. Double-click the Sort transformation to open the editor. Select all the Available
Input Columns starting with FirstName, as shown in Figure 10-26. Select the
check box for “Remove rows with duplicate sort values” and click OK to close the
editor. Rename the Sort transformation Removing Exact Duplicates by Sort Key.
10. Add a Lookup transformation to the data flow and join the Sort transformation
by dragging the green arrow onto it.
11. Double-click the Lookup transformation to open the Lookup Transformation
Editor. Select “Redirect rows to no match output” option in the “Specify how to
handle rows with no matching entries” field. Click the Connection page link in
the pane.
12. Add the localhost.Campaign connection manager using the New button next to
the OLE DB connection manager field. Then select [dbo].[owner] table from the
drop-down list in the “Use a table or a view” field.
13. Go to the Columns tab, and map all the Available Input Columns to the Available
Lookup Columns for the matching names as shown in Figure 10-27. Click to

select the OwnerID column in the Available Lookup Columns to add it as a new
column in the output. In this transformation, you are matching input records
with the records in the Owner table, and for the matching record, you are adding
OwnerID in the output to capture the ID of the record with which a match has
been found. Click OK to close this editor. Rename the Lookup transformation
Removing Exact Duplicates by Lookup.
14. Drop an Excel destination just below Removing Exact Duplicates By Lookup
and join it to the Excel Destination. You will be asked to select an Output in the
Input Output Selection dialog box. Select Lookup Match Output to connect to
Excel Destination. Rename this Excel Destination Exact Duplicates.
15. Double-click the Excel Destination to configure it. Choose DuplicateOwners
Connection in the OLE DB Connection Manager field if it’s not already selected.
Chapter 10: Data Flow Transformations 471
16. Click the New button next to the Name of the Excel sheet field and verify the
CREATE TABLE statement that it is creating an Exact Duplicates table—i.e.,
a worksheet in Excel. Click OK to accept. Select Exact_Duplicates in this field.
Go to the Mappings page to create the necessary column mappings that happen
automatically. Click OK to close this editor.
Figure 10-26 Removing exact duplicates using a Sort transformation
472 Hands-On Microsoft SQL Server 2008 Integration Services
Exercise (Remove Fuzzy Duplicates)
You’ve extracted exact duplicates from the Lookup transformation to an Excel destination.
Now you will redirect the remaining rows to a Fuzzy Lookup and extract the duplicates
using fuzzy match logic. You will be removing all the variants of Johnathon Skinner.
17. Drop a Fuzzy Lookup transformation onto the Data Flow surface to the right
of the Removing Exact Duplicates by Lookup component. Click Removing
Exact Duplicates by Lookup and drag and drop the second green arrow, which is
Figure 10-27 Removing exact duplicates using the Lookup transformation
Chapter 10: Data Flow Transformations 473
actually the Lookup No Match Output, onto the Fuzzy Lookup transformation.

You have redirected all the remaining rows that did not match in the exact lookup
operation to the Fuzzy Lookup transformation.
18. Double-click the Fuzzy Lookup transformation and make sure the Generate New
Index radio button is selected. Select the [dbo].[Owner] table from the drop-down
list in the Reference Table Name field. Go to the Columns tab.
19. Select the check box of the OwnerID column in the Available Lookup Columns
to add it in the fuzzy matched records, as shown in Figure 10-28.
20. Go to the Advanced page and set the Maximum number of matches to output
per lookup equal to 3. Slide the Similarity Threshold slider to 0.30. Click OK
Figure 10-28 Identifying fuzzy matched records with Fuzzy Lookup
474 Hands-On Microsoft SQL Server 2008 Integration Services
to close the Transformation Editor. Rename this transformation as Identifying
Duplicates by Fuzzy Lookup.
21. Drop the Conditional Split transformation just below the Identifying Duplicates
by Fuzzy Lookup transformation and connect both the components using a green
data flow path. Double-click the Conditional Split to edit it. You need to identify
the duplicates using _Similarity and _Confidence columns added by the Fuzzy
Lookup transformation in the pipeline. You may have to run your package several
times before you can get workable values for this identification. Here, I’ve already
done the work for this data to identify these values; add the following in the
Condition field:
_Similarity >= 0.60 && _Confidence >= 0.50
Type Fuzzy Lookup Matches in the Output Name field and Remaining Owners
in the Default Output Name field, as shown in Figure 10-29. Click OK to close
the editor. Rename this transformation Splitting Fuzzy Matched Duplicates.
22. Drop an Excel Destination below the Conditional Split transformation and join
both of these components using a green arrow. Select Fuzzy Lookup Matches in
the Output field of the Input Output Selection dialog box. Rename this Excel
Destination Fuzzy Matched Duplicates.
23. Double-click the Excel Destination to configure it. Select DuplicateOwners

Connection in the OLE DB Connection Manager field if it is not already selected.
24. Click the New button next to the Name of the Excel sheet field, verify in the
CREATE TABLE statement that is creating the Fuzzy Matched Duplicates
table—i.e., the worksheet in Excel—click OK to accept, and then select the
Fuzzy_Matched_Duplicates in this field.
25. Go to the Mappings page to create necessary column mappings automatically.
Note that in addition to the _Similarity and _Confidence columns, the Fuzzy
Lookup transformation has added one _Similarity_ColumnName column for
each column that participated in the fuzzy match. Click OK to close this editor.
Exercise (Remove Duplicates by Fuzzy Grouping)
Here you will remove duplicate variants of the Kathrine Morris. These records do not
exist in the reference table but appear multiple times in the input file in variant forms.
You will capture the Remaining Owners from Split transformation and use a fuzzy
grouping to identify canonical row and the likely duplicates.
26. Drop Fuzzy Grouping in the pipeline to the right of the Splitting Fuzzy Duplicates
component. Click Splitting Fuzzy Duplicates and then drag the available green
arrow from the Splitting Fuzzy Duplicates and drop it on the Fuzzy Grouping
transformation.
Chapter 10: Data Flow Transformations 475
27. Double-click the Fuzzy Grouping transformation and verify that the localhost.
Campaign is selected in the OLE DB Connection Manager field. Go to the
Columns tab. Note that all the Available Input Columns have Pass Through check
boxes selected. Uncheck all of them. Select only the columns that are coming from
OwnersFeed—i.e., from FirstName to Country, as shown in Figure 10-30. As you
select these columns, a line corresponding to each selected column will be added
in the grid. Type 0.25 in the Minimum Similarity column for the FirstName and
LastName rows.
Figure 10-29 Splitting fuzzy matched records from the Data Flow.
476 Hands-On Microsoft SQL Server 2008 Integration Services
28. Go to the Advanced tab and set the Similarity threshold to 0.50 using the slider.

Click OK to close this editor. Rename the Fuzzy Grouping transformation
Identifying Duplicates by Fuzzy Grouping.
29. Drop the Conditional Split transformation just below the Identifying Duplicates
by Fuzzy Grouping transformation and connect both the components using a
green data flow path. Double-click the Conditional Split to edit it. You need
to split the duplicates using _key_in and _key_out columns added by the Fuzzy
Figure 10-30 Identifying and grouping duplicates using the Fuzzy Grouping transformation
Chapter 10: Data Flow Transformations 477
Grouping transformation in the pipeline. Add the two rows in the grid by adding
the following in the Condition field:
_key_in == _key_out
_key_in != _key_out
Type Canonical Row in the first row and Fuzzy Grouped Matches in the second
row of Output Name field, as shown in Figure 10-31. Click OK to close the
editor. Rename this transformation Splitting Fuzzy Grouped Duplicates.
Figure 10-31 Splitting fuzzy grouped and unique records in the data flow

×