Nielsen c29.tex V4 - 07/23/2009 4:59pm Page 682
Part IV Developing with SQL Server
Summary
Writing dynamic SQL is definitely taking your game to a higher level. Code that creates code. Cool.
Remember to always be aware of whether or not the parameters can be used to enable SQL injection.
A few key points from this chapter:
■ Build up the
@SQLStr variable from the inside out — start with the dynamic list and then
append the
SELECT prolog.
■ If the
WHERE clause is dynamic, chances are good the FROM clause will be also.
■ When writing dynamic SQL, add a
PRINT statement to output the @SQLStr variable during
development. It makes debugging much easier.
■ Use
sp_executesql.
■ If there’s a different way to make the code flexible, such as the parse and join method
mentioned in the best practice, do that instead of dynamic SQL.
■ All the standard database integrity features (e.g., foreign keys) help defend against SQL
injection.
■ Always think like a hacker. Where can an SQL injection string be used to alter the intention of
the code?
■ Dynamic SQL is not necessarily ad-hoc SQL. Never permit ad-hoc SQL to your database.
This concludes a ten-chapter discussion on T-SQL development that began with ‘‘what is a batch’’ and
progressed to the point of code-generating batches. If you’re up to writing code-generating code in
T-SQL, you’re doing well as a SQL Server database developer. I congratulate you.
682
www.getcoolebook.com
Nielsen p05.tex V4 - 07/10/2009 4:25pm Page 683
Data Connectivity
IN THIS PART
Chapter 30
Bulk Operations
Chapter 31
Executing Distributing Queries
Chapter 32
Programming with ADO.NET
Chapter 33
Sync Framework
Chapter 34
LINQ
Chapter 35
Asynchronous Messaging with
Service Broker
Chapter 36
Replicating Data
Chapter 37
Performing ETL with Integration
Services
Chapter 38
Access as a Front End to SQL
Server
A
s much as I’d like to think that Management Studio is the ultimate
UI and there’s no need for any other interface to SQL Server,
the truth is that SQL Server needs to connect to nearly any possible
data conduit.
Other than Chapter 5, ‘‘Client Connectivity,’’ all the code so far has
occurred inside SQL Server. Part V focuses on myriad ways that data can be
brought into and synchronized with SQL Server.
Some of the connectivity technologies are well known and familiar
technologies like the simple but mighty bulk insert, distributed queries and
linked servers, ADO.NET, replication, and Microsoft Access.
Other connectivity technologies are newer. Integration services replaced DTA
with SQL Server 2005. Service Broker was also introduced with SQL Server
2005. LINQ and Synch are new with SQL Server 2008.
If SQL Server is the box, then this part busts out of the box and pumps
data in and out of SQL Server.
www.getcoolebook.com
Nielsen p05.tex V4 - 07/10/2009 4:25pm Page 684
www.getcoolebook.com
Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 685
Bulk Operations
IN THIS CHAPTER
Mass inserts from
comma-delimited files
T-SQL ETL processing
Bulk insert
BCP
Performing bulk operations
O
ften, DBAs need to load copious amounts of data quickly — whether
it’s a nightly data load or a conversion from comma-delimited text files.
When a few hundred megabytes of data need to get into SQL Server in a
limited time frame, a bulk operation is the way to get the heavy lifting done.
XML’s popularity may be growing, but its file sizes seem to be growing even
faster. XML’s data tags add significant bloat to a data file, sometimes quadrupling
the file size or more. For very large files, IT organizations are sticking with CSV
(also known as comma-delimited) files. For these old standby files, the best way
to insert that data is a bulk operation.
In SQL Server, bulk operations pump data directly to the data file according to
the following models:
■ Simple recovery model: No problem with recovery; the transaction log
is used for current transactions only.
■ Bulk-logged recovery model: No problem with recovery; the bulk
operation transaction bypasses the log, but then the entire bulk opera-
tion’s data is still written to the log. One complication with bulk-logged
recovery is that if bulk operations are undertaken, point-in-time recov-
ery is not possible for the time period covered by the transaction log.
To regain point-in-time recovery, the log has to be backed up. As extent
allocations are logged for bulk operations, a log backup after bulk
operations will contain all the pages from extents that have been added,
which results in a large transaction log backup.
■ Full recovery model: In full recovery model, bulk operations are not
performed; the engine does full logging of inserts. To restart the transac-
tion log recoverability process, following the bulk operation, perform a
complete backup, and restart the transaction logs.
685
www.getcoolebook.com
Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 686
Part V Data Connectivity
For more details on recovery models and how to set them, see Chapter 41, ‘‘Recov-
ery Planning.’’ Details on the transaction log are covered in Chapter 66, ‘‘Managing
Transactions, Locking, and Blocking.’’
Technically, the
SELECT INTO syntax is also a bulk-logged operation, and it too bypasses the transaction
log.
SELECT INTO creates a table from the results of a SELECT statement; it is discussed in Chapter 15,
‘‘Modifying Data.’’
Bulk insert operations are normally one step of an ETL (extract-transform-load) nightly process. While
developing these ETL processes in T-SQL is perfectly acceptable, Integration Services is a strong alterna-
tive, and it includes bulk operations. For more details about developing Integration Services solutions,
see Chapter 37, ‘‘Performing ETL with Integration Services.’’
Bulk insert is extremely fast, and I’ve had good success using it in production environ-
ments. My one word of caution is that the data must be clean. Variations in data type,
irregular columns, and missing columns will cause trouble.
Bulk operations can be performed with a command prompt using BCP (a command-prompt utility
to copy data to and from SQL Server), within T-SQL using the
BULK INSERT command, or using
Integration Services.
Bulk Insert
The BULK INSERT command can be used within any T-SQL script or stored procedure to import data
into SQL Server. The parameters of the command specify the table receiving the data, the location of the
source file, and the options.
To test the
BULK INSERT command, use the Address.csv file that’s part of the build script to
load the
AdventureWorks sample database. It’s probably already on your hard drive or it can
be downloaded from MSDN. The 4MB file has 19,614 rows of address data — that’s small by
ETL norms.
The following batch bulk inserts from the
Address.csv file in the AdventureWorks directory into
the
AWAddress table:
Use Tempdb;
CREATE TABLE AWAddressStaging (
ID INT,
Address VARCHAR(500),
City VARCHAR(500),
Region VARCHAR(500),
PostalCode VARCHAR(500),
GUID VARCHAR(500),
Updated DATETIME
);
686
www.getcoolebook.com
Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 687
Bulk Operations 30
BULK INSERT AWAddressStaging
FROM ‘C:\Program Files\Microsoft SQL Server\90\Tools\Samples\
AdventureWorks OLTP\Address.csv’
WITH (FIRSTROW = 1,ROWTERMINATOR =’\n’);
On my Dell notebook, the BULK INSERT completes in less than a half-second.
The first thing to understand about
BULK INSERT is that every column from the source table is sim-
ply inserted directly into the destination table using a one-to-one mapping. The first column from the
source file is dumped into the first column of the destination table. Each column lines up. If there are
too many columns in the destination table, then it will fail. However, if there are not enough columns in
the destination table, then
BULK INSERT will work, as the extra data is placed into the bit bucket and
simply discarded.
The BULK INSERT command won’t accept a string concatenation or variable in the FROM
parameter, so if you’re assembling the string of the file location and the filename, then
you need to assemble a dynamic SQL statement to execute the
BULK INSERT. Building and executing
dynamic SQL is covered in Chapter 29, ‘‘Dynamic SQL and Code Generation,’’ which contains many
examples.
Best Practice
B
ecause BULK INSERT is dependent on the column position of both the source file and the destination
table, it is best practice to use a view as an abstraction layer between the BULK INSERT command and
the table. If the structure of either the source file or the destination table is altered, then modifying the view
can keep the BULK INSERT running without having to change the other object’s structure.
Another best practice is to BULK INSERT the data into a staging table, check the data, and then perform
the rest of the transformations as you merge the data into the permanent tables. As long as you don’t mind
copying the data twice, this works well.
Bulk Insert Options
In practice, I’ve always needed to use some options when using BULK INSERT:
■ Field Terminator specifies the character used to delimit or separate columns in the source
file. The default, of course, is a comma, but I’ve also seen the pipe character (|) used in
production.
■ Row Terminator specifies the character that ends a row in the source file. ‘
\n’ means end of row
and is the typical setting. However, files from mainframes or other systems sometimes don’t use a
clean end of line. In these cases, use a hex editor to view the actual end of line characters and specify
the row terminator in hex. For example, a hex value of ‘
0A’ is coded as follows:
ROWTERMINATOR = ‘0x0A’
687
www.getcoolebook.com
Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 688
Part V Data Connectivity
■ FirstRow is useful when specifying whether the incoming file has column headers. If the file
does have column headers, then use this option to indicate that the first row of data is actually
the second row of the file.
■ TabLock places an exclusive lock on the entire table and saves SQL Server the trouble of
having to lock the table’s data pages being created. This option can dramatically improve
performance, but at the cost of blocking data readers during the bulk insert. If the bulk insert
is part of an ETL into a staging table, then there’s no problem, but if it’s a bulk insert into
a production system with potential users selecting data, then this might not be a good idea.
Multiple bulk-import streams can potentially block each other. To prevent this, SQL Server
provides a special internal lock, called a bulk-update (BU) lock. To get a BU lock, you need to
specify the
TABLOCK option with each bulk-import stream without blocking other bulk-import
streams.
■ Rows per Batch tells SQL Server to insert n number of rows in a single batch, rather than
the entire file. Tweaking the batch size can improve performance. I’ve found that beginning
with 100 and then experimenting to find the best size for the particular set of data works best.
This helps performance because the logging is done less often. Too many rows, however, often
exceed memory cache and may create waits. In my experience, 2,000 rows is often the best
number.
■ Max Errors specifics how many rows can fail before the bulk insert fails. Depending on the
business requirement for the data, you may need to set this to zero.
■ The Errorfile option points to a file that will collect any rows not accepted by the bulk insert
operation. This is a great idea and should be used with every
BULK INSERT command in
production.
Other options, which I’ve never needed in production, include
Check_Constraints, CodePage,
DataFileType, Fire_Triggers, KeepIdentity, KeepNulls, Kilobytes_per_batch,and
Order. (The best practice of bulk inserting into a staging table and then performing the ETL merge into
the permanent tables makes these commands less useful.)
Best Practice
BULK INSERT handles columns in the order they appear in the source comma-delimited file, and the
columns must be in the same order in the receiving SQL table. Bulk inserting into a view provides a data
abstraction layer so that any changes in column order won’t break the BULK INSERT code.
When developing a BULK INSERT statement, it’s generally useful to open the source file using Excel
and examine the data. Excel often reformats data, so it’s best not to save files in Excel. Sorting the data
by the columns can help find data formatting anomalies.
688
www.getcoolebook.com
Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 689
Bulk Operations 30
BCP
BCP, short for bulk copy program (or bulk copy Porsche — a reference among DBAs to its speed), is a
command-line variation of bulk operations. BCP differs from
BULK INSERT in that it is command-line
executed and can import or export data. It uses many of the same options as
BULK INSERT.Thebasic
syntax is as follows:
BCP destination table direction datafile options
For the destination, use the server name along with the complete three-part name (server and
database.schema.object). For a complete listing of the syntax, just type BCP at the command prompt.
Because this is an external program, it needs authorization to connect to SQL Server. You have two
options: Use the
-P password option and hard-code your password into the batch file script, or omit
the
-P, in which case it will prompt for a password. Neither is a very good option. You can also use
integrated security, which is usually considered the best practice.
For straightforward ETL operations, I prefer using T-SQL and BULK INSERT.Forcomplex
ETL loads, Integration Services is great. To be frank, I have little use for automating ETL
processes using DOS batch scripts and BCP, although Powershell may make a believer of me yet.
Summary
This chapter explained a specific T-SQL command: BULK INSERT. Bulk operations provide the
additional horsepower needed to import massive amounts of data by ignoring the transaction log and
pumping the data directly to the table. The downside is that it complicates the recovery plan. The best
way to perform a bulk operation is with the
BULK INSERT T-SQL command.
The next chapter continues the theme of moving data with a personal favorite of mine, distributed
queries — when one SQL Server instance executes a query against a second SQL Server instance.
689
www.getcoolebook.com
Nielsen c30.tex V4 - 07/21/2009 1:25pm Page 690
www.getcoolebook.com
Nielsen c31.tex V4 - 07/21/2009 2:03pm Page 691
Executing Distributed
Queries
IN THIS CHAPTER
Understanding distributed
queries
Making the connection with
remote data sources
T-SQL distributed queries
Pass-through queries
Two-phase commits and
distributed transactions
D
ata is seldom in one place. In today’s distributed world, most new
projects enhance, or at least connect to, existing data. That’s not a
problem; SQL Server can read and write data to most other data
sources. Heterogeneous joins can even merge SQL Server data with an Excel
spreadsheet.
SQL Server offers several methods for accessing data external to the cur-
rent database. From simply referencing another local database to executing
pass-through queries that engage an Oracle server, SQL Server can handle it.
Distributed Query Concepts
Linking to an external data source is nothing more than configuring the name of
the linked server, along with the necessary location and login information, so that
SQL Server can access data on the linked server.
Linking is a one-way configuration, as illustrated in Figure 31-1. If Server A
links to Server B, then it means that Server A knows how to access and log into
Server B. As far as Server B is concerned, Server A is just another user.
If linking a server is a new concept to you, then it could easily be confused with
registering a server in Management Studio. As illustrated in Figure 31-1, Man-
agement Studio is only communicating with the servers as a client application.
Linking the servers enables SQL Server instance A to communicate directly with
SQL Server instance B.
Links can be established in Management Studio or with T-SQL code (which
could, of course, be created by configuring the link in Management Studio and
then generating a script.) The latter has the advantage of repeatability in case a
rebuild is necessary, although building the links in code requires more steps.
691
www.getcoolebook.com