Tải bản đầy đủ (.pdf) (10 trang)

Joe Celko s SQL for Smarties - Advanced SQL Programming P14 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (149.61 KB, 10 trang )


102 CHAPTER 3: NUMERIC DATA IN SQL

Numbers in SQL are classified as either exact or approximate. An
exact numeric value has a precision,

p

, and a scale,

s

. The precision is a
positive integer that determines the number of significant digits in a
particular radix. The Standard says the radix can be either binary or
decimal, so you need to know what your implementation does. The scale
is a nonnegative integer that tells you how many decimal places the
number has.
Today, there are not that many base-ten platforms, so you probably
have a binary machine. However, a number can have one of many binary
representations—twos-complement; ones-complement; high-end, or
low-end—and various word sizes. The proper mental model of numbers
in SQL is not to worry about the “bits and bytes” level of the physical
representation, but to think in abstract terms.
The data types

NUMERIC

,

DECIMAL



,

INTEGER

,

BIGINT

, and

SMALLINT

are exact numeric types. An integer has a scale of zero, but
the syntax simply uses the word

INTEGER

or the abbreviation

INT

.

SMALLINT

has a scale of zero, but the range of values it can hold is less
than or equal to the range that

INTEGER


can hold in the
implementation. Likewise,

BIGINT

has a scale of zero, but the range of
values it can hold is greater than or equal to the range that

INTEGER

can
hold in the implementation.

BIGINT

was added in SQL-99, but had
been common in products before then.

DECIMAL(p,s)

can also be written

DEC(p,s)

. For example,

DECIMAL(8,2)

could be used to hold the number 123456.78, which

has eight significant digits and two decimal places.
The difference between

NUMERIC

and

DECIMAL

is subtle.

NUMERIC


specifies the exact precision and scale to be used.

DECIMAL

specifies the
exact scale, but the precision is implementation-defined to be equal to or
greater than the specified value.
Mainframe COBOL programmers can think of

NUMERIC

as a COBOL

PICTURE

numeric type, whereas


DECIMAL

is like a BCD. Personal
computer programmers these days probably have not seen anything like
this. You may find that many small-machine SQLs do not support

NUMERIC

or

DECIMAL

, because the programmers do not want to have to
have COBOL-style math routines that operate on character strings or an
internal decimal representation.
An approximate numeric value consists of a mantissa and an
exponent. The mantissa is a signed numeric value; the exponent is a
signed integer that specifies the magnitude of the mantissa. An
approximate numeric value has a precision. The precision is a positive
integer that specifies the number of significant binary digits in the

3.1 Numeric Types 103

mantissa. The value of an approximate numeric value is the mantissa
multiplied by 10 to the exponent.

FLOAT(p)

,


REAL

, and

DOUBLE
PRECISION

are the approximate numeric types. There is a subtle
difference between

FLOAT(p)

, which has a binary precision equal to or
greater than the value given, and

REAL

, which has an implementation-
defined precision.
The IEEE Standard 754 for floating point numbers is the most
common representation today. It is binary and uses 32 bits for single
precision and 64 bits for double precision, which is just right for Intel-
based PCs, Macintoshes, and most UNIX platforms. Its math functions
are available burned into processor chips, so they will run faster than a
software implementation.
The range for single precision numbers is approximately ± 10^-44.85
to 10^38.53, and for double precision, approximately ± 10^-323.3 to
10^308.3. However, there are some special values in the Standard.
Zero cannot be directly represented in this format, so it is modeled as

a special value denoted with an exponent field of zero and a fraction field
of zero. The sign field can make this either



0 and +0, which are distinct
values that compare as equal.
If the exponent is all zeroes, but the fraction is nonzero (else it would
be interpreted as zero), then the value is a denormalized number, which
is not assumed to have a leading 1 before the binary point. Thus, this
represents a number (-

s

* 0.

f

* 2



126), where

s

is the sign bit and

f


is
the fraction. For double precision, denormalized numbers are of the
form (-

s

* 0.

f

* 2



1022). You can interpret zero as a special type of
denormalized number.
The two values “+infinity” and “



infinity” are denoted with an
exponent of all ones and a fraction of all zeroes. The sign bit
distinguishes between negative infinity and positive infinity. Being able
to denote infinity as a specific value is useful, because it allows
operations to continue past overflow situations. Operations with infinite
values are well defined in the IEEE floating point.
The value NaN (Not a Number) is used to represent a bit
configuration that does not represent a number. NaNs are represented
by a bit pattern with an exponent of all ones and a nonzero fraction.
There are two categories of NaN: QNaN (Quiet NaN) and SNaN

(Signalling NaN).
A QNaN is a NaN with the most significant fraction bit set. QNaNs
propagate freely through most arithmetic operations. These values pop
out of an operation when the result is not mathematically defined, like
division by zero.

104 CHAPTER 3: NUMERIC DATA IN SQL

An SNaN is a NaN with the most significant fraction bit clear. It is
used to signal an exception when used in operations. SNaNs can be
handy to assign to uninitialized variables to trap premature usage.
Semantically, QNaNs denote indeterminate operations, while SNaNs
denote invalid operations.
SQL has not accepted the IEEE model for mathematics for several
reasons. Much of the SQL Standard allows implementation-defined
rounding, truncation, and precision to avoid limiting the language to
particular hardware platforms. If the IEEE rules for math were allowed in
SQL, then we would need type conversion rules for infinite and a way to
represent an infinite exact numeric value after the conversion. People
have enough trouble with

NULL

s, so let’s not go there.

3.1.1 BIT, BYTE, and BOOLEAN Data Types

Machine-level things like a

BIT


or

BYTE

data type have no place in SQL.
SQL is a high-level language; it is abstract and defined without regard to

physical

implementation. This basic principle of data modeling is called
data abstraction. Bits and bytes are the

lowest

units of hardware-specific
physical implementation you can get. Are you on a high-end or low-end
machine? Does the machine have 8-, 16-, 32-, 64-, or 128-bit words?
Twos-complement or ones-complement math? Hey, the SQL Standard
allows decimal machines, so bits do not exist at all!
What about

NULL

s in this data type? To be an SQL data type, you
have to have

NULL

s, so what is a


NULL

bit? By definition, a bit is in one
of two states, on or off, and has no

NULL

. If your vendor adds

NULLs

to
bit, how are the bit-wise operations defined? Oh, what a tangled web we
weave when first we mix logical and physical models.
What does the implementation of the host languages do with bits?
Did you know that +1, +0,



0 and



1 are all used for

BOOLEAN

s, but not
consistently? In C#, Boolean values are 0/1 for


FALSE

/

TRUE

, while
VB.NET has Boolean values of 0/-1 for

FALSE

/

TRUE

—and they are
proprietary languages from the same vendor. That means

all

the host
languages—present, future, and not-yet-defined—can be different.



There are usually two situations in practice. Either the bits are
individual attributes, or they are used as a vector to represent a single
attribute. In the case of a single attribute, the encoding is limited to two
values, which do not port to host languages or other SQLs, cannot be

easily understood by an end user, and cannot be expanded.
In the second case, what some newbies, who are still thinking in
terms of second- and third-generation programming languages or even
punch cards, do is build a vector for a series of “yes/no” status codes,

3.2 Numeric Type Conversion 105

failing to see the status vector as a single attribute. Did you ever play the
children’s game “20 Questions” when you were young?
Imagine you have six components for a loan approval, so you allocate
bits in your second-generation model of the world. You have 64 possible
vectors, but only 5 of them are valid (i.e., you cannot be rejected for
bankruptcy and still have good credit). For your data integrity, you can:
1. Ignore the problem. This is actually what

most newbies do.
2. Write elaborate
CHECK() constraints with user-defined
functions or proprietary bit-level library functions that cannot
port and that run like cold glue.
Now we add a seventh condition to the vector—which end does it go
on? Why? How did you get it in the right place on all the possible
hardware that it will ever use? Did all the code that references a bit in a
word by its position do it right after the change?
You need to sit down and think about how to design an encoding of
the data that is high-level, general enough to expand, abstract, and
portable. For example, is that loan approval a hierarchical code?
Concatenation code? Vector code? Did you provide codes for unknown,
missing, and N/A values? It is not easy to design such things!
3.2 Numeric Type Conversion

There are a few surprises in converting from one numeric type to
another. The SQL Standard left it up to the implementation to answer a
lot of basic questions, so the programmer has to know his package.
3.2.1 Rounding and Truncating
When an exact or approximate numeric value is assigned to an exact
numeric column, it may not fit. SQL says that the database engine will
use an approximation that preserves leading significant digits of the
original number after rounding or truncating. The choice of whether to
truncate or round is implementation-defined, however. This can lead to
some surprises when you have to shift data among SQL
implementations, or shift storage values from a host language program
into an SQL table. It is probably a good idea to create the columns with
more decimal places than you think you need.
Truncation is defined as truncation toward zero; this means that 1.5
would truncate to 1, and −1.5 would truncate to −1. This is not true for
106 CHAPTER 3: NUMERIC DATA IN SQL
all programming languages; everyone agrees on truncation toward zero
for the positive numbers, but you will find that negative numbers may
truncate away from zero (i.e., −1.5 would truncate to −2).
SQL is also indecisive about rounding, leaving the implementation
free to determine its method. There are two major types of rounding in
programming.
The scientific method looks at the digit to be removed. If this digit is
0, 1, 2, 3, or 4, you drop it and leave the higher-order digit to its left
unchanged. If the digit is 5, 6, 7, 8, or 9, you drop it and increment the
digit to its left. This method works with a small set of numbers and was
popular with FORTRAN programmers because it is what engineers use.
The commercial method looks at the digit to be removed. If this digit
is 0, 1, 2, 3, or 4, you drop it and leave the digit to its left unchanged. If
the digit is 6, 7, 8, or 9, you drop it and increment the digit to its left.

However, when the digit is 5, you want to have a rule that will round up
about half the time.
One rule is to look at the digit to the left: if it is odd, then leave it
unchanged; if it is even, increment it. There are other versions of the
decision rule, but they all try to make the rounding error as small as
possible. This method works with a large set of numbers, and it is popular
with bankers because it reduces the total rounding error in the system.
Another convention is to round to the nearest even number, so that
both 1.5 and 2.5 round to 2, and 3.5 and 4.5 both round to 4. This rule
keeps commercial rounding symmetric. The following expression uses the
MOD() function to determine whether you have an even number or not.
ROUND (CAST (amount - .0005 AS DECIMAL (14,4)) -
(CAST (MOD (CAST (amount*100.0 + .99 AS INTEGER), 2) AS
DECIMAL (14,4))/1000.0), 2);
In commercial transactions, you carry money amounts to four or
more decimal places, but round them to two decimal places for display.
This is a GAAP (Generally Accepted Accounting Practice) in the United
States for U.S. dollars and a law in Europe for working with euros.
Here is your first programming exercise for the notes you are making
on your SQL.
Generate a table of 5,000 random numbers, both positive and
negative, with four decimal places. Round the test data to two decimal
places, and total them using both methods.
Notice the difference, and save those results. Now load those same
numbers into a table in your SQL, like this:
3.2 Numeric Type Conversion 107
CREATE TABLE RoundTest
(original DECIMAL(10,4) NOT NULL,
rounded DECIMAL(10,2) NOT NULL);
insert the test data

INSERT INTO RoundTest (original) VALUES (2134.5678. 0.00);
etc.
UPDATE RoundTest SET rounded = original;
write a program to use both rounding methods
compare those results to this query
SELECT SUM(original), SUM(rounded)
FROM RoundTest;
Compare these results to those from the other two tests. Now you
know what your particular SQL is doing. Or, if you got a third answer,
there might be other things going on, which we will deal with in Chapter
21 on aggregate functions. We will postpone discussion here, but the
order of the rows in a
SUM() function can make a difference in
accumulated floating-point rounding error.
3.2.2 CAST() Function
SQL-92 defines the general CAST(<cast operand> AS <data
type>) function for all data type conversions, but most
implementations use several specific functions of their own for the
conversions they support. The SQL-92
CAST() function is not only
more general, but it also allows the
<cast operand> to be either a
<column name>, a <value expression>, or a NULL.
For numeric-to-numeric conversion, you can do anything you wish,
but you have to watch for the rounding errors. The comparison
predicates can hide automatic type conversions, so be careful. SQL
implementations will also have formatting options in their conversion
functions that are not part of the Standard. These functions either use a
PICTURE string, like COBOL or some versions of BASIC, or return their
results in a format set in an environment variable. This is very

implementation-dependent.
108 CHAPTER 3: NUMERIC DATA IN SQL
3.3 Four-Function Arithmetic
SQL is weaker than a pocket calculator. The dyadic arithmetic operators
+, −, *, and / stand for addition, subtraction, multiplication, and
division, respectively. The multiplication and division operators are of
equal precedence and are performed before the dyadic plus and minus
operators.
In algebra and in some programming languages, the precedence of
arithmetic operators is more restricted. They use the “My Dear Aunt
Sally” rule; that is, multiplication is done before division, which is done
before addition, which is done before subtraction. This practice can lead
to subtle errors.
For example, consider (largenum + largenum − largenum), where
largenum is the maximum value that can be represented in its numeric
data type. If you group the expression from left to right, you get
((largenum + largenum) − largenum) = overflow error! However, if you
group the expression from right to left, you get (largenum + (largenum
− largenum)) = largenum.
Because of these differences, an expression that worked one way in
the host language may get different results in SQL, and vice versa. SQL
could reorder the expressions to optimize them, but in practice, you will
find that many implementations will simply parse the expressions from
left to right. The best way to be safe is always to make extensive use of
parentheses in all expressions, whether they are in the host language or
in your SQL.
The monadic plus and minus signs are allowed and you can string as
many of them in front of a numeric value of variables as you like. The bad
news about this decision is that SQL also uses Ada-style comments,
which put the text of a comment line between a double dash and a new

line-character. This means that the parser has to figure out whether “ ”
is two minus signs or the start of a comment. Most versions of SQL also
support C-style comment brackets (i.e.,
/* comment text */). Such
brackets have been proposed in the SQL3 discussion papers, because
some international data transmission standards do not recognize a new
line in a transmission, and the double-dash convention will not work.
If both operands are exact numeric, the data type of the result is exact
numeric, as you would expect. Likewise, an approximate numeric in a
calculation will cast the results to approximate numeric. The kicker is in
how the results are assigned in precision and scale.
Let S1 and S2 be the scale of the first and second operands,
respectively. The precision of the result of addition and subtraction is
implementation-defined, and the scale is the maximum of S1 and S2.
3.4 Arithmetic and NULLs 109
The precision of the result of multiplication is implementation-defined,
and the scale is (S1 + S2). The precision and scale of the result of division
are implementation-defined, and so are some decisions about rounding
or truncating results.
The ANSI X3H2 committee debated about requiring precision and
scales in the standard and finally gave up. This means I can start losing
high-order digits, especially with a division operation, where it is
perfectly legal to make all results single-digit integers.
Nobody does anything that stupid in practice. In the real world, some
vendors allow you to adjust the number of decimal places as a system
parameter, some default to a few decimal places, and some display as
many decimal places as they can, so that you can round off to what you
want. You will simply have to learn what your implementation does by
experimenting with it.
Most vendors have extended this set of operators with other common

mathematical functions. The most common additional functions are
modulus, absolute value, power, and square root. But it is also possible
to find logarithms to different bases, and to perform exponential,
trigonometric, and other scientific, statistical, and mathematical
functions.
Precision and scale are implementation-defined for these functions,
of course, but they tend to follow the same design decisions as the
arithmetic did. The reason is obvious: they are using the same library
routines under the covers as the math package in the database engine.
3.4 Arithmetic and NULLs
NULLs are probably one of the most formidable database concepts for
the beginner. Chapter 6 is devoted to a detailed study of how
NULLs
work in SQL, but this section is concerned with how they act in
arithmetic expressions.
The
NULL in SQL is only one way of handling missing values. The
usual description of
NULLs is that they represent currently unknown
values that might be replaced later with real values when we know
something. This definition actually covers a lot of territory. The Interim
Report 75-02-08 to the ANSI X3 (SPARC Study Group 1975) showed 14
different kinds of incomplete data that could appear as the results of
operations or as attribute values. They included such things as arithmetic
underflow and overflow, division by zero, string truncation, raising zero
to the zeroth power, and other computational errors, as well as missing
or unknown values.
110 CHAPTER 3: NUMERIC DATA IN SQL
The NULL is a global creature, not belonging to any particular data
type but able to replace any of their values. This makes arithmetic a bit

easier to define. You have to specifically forbid
NULLs in a column by
declaring the column with a
NOT NULL constraint. But in SQL-92, you
can use the
CAST function to declare a specific data type for a NULL,
such as
CAST(NULL AS INTEGER). One reason for this convention is
completeness; another is to let you pass information about how to create
a column to the database engine.
The basic rule for math with
NULLs is that they propagate. An
arithmetic operation with a
NULL will return a NULL. That makes sense; if
a
NULL is a missing value, then you cannot determine the results of a
calculation with it. However, the expression (
NULL / 0) looks strange to
people. The first thought is that a division by zero should return an error;
if
NULL is a true missing value, there is no value to which it can resolve
and make that expression valid. However, SQL propagates the
NULL,
while a non-
NULL value divided by zero will cause a runtime error.
3.5 Converting Values to and from NULL
Since host languages do not support NULLs, the programmer can elect
either to replace them with another value that is expressible in the host
language or to use indicator variables to signal the host program to take
special actions for them.

3.5.1 NULLIF() Function
SQL specifies two functions, NULLIF() and the related COALESCE(),
that can be used to replace expressions with
NULL, and vice versa. They
are part of the
CASE expression family.
The
NULLIF(V1, V2) function has two parameters. It is equivalent
to the following
CASE expression:
NULLIF(V1, V2) := CASE
WHEN (V1 = V2)
THEN NULL
ELSE V1 END;
That is, when the first parameter is equal to the second, the function
returns a
NULL; otherwise, it returns the first parameter’s value. The
properties of this function allow you to use it for many purposes. The
important properties are these:
3.5 Converting Values to and from NULL 111
1. NULLIF(x, x) will return NULL for all values of x. This
includes
NULL, since (NULL = NULL) is UNKNOWN, not TRUE.
2.
NULLIF(0, (x - x)) will convert all non-NULL values of x
into
NULL.
But it will convert x
NULL into x zero, since (NULL - NULL)
is

NULL, and the equality test will fail.
3.
NULLIF(1, (x - x + 1)) will convert all non-NULL values
of x into
NULL.
But it will convert a
NULL into a one. This can be general-
ized for all numeric data types and values.
3.5.2 COALESCE() Function
The COALESCE(<value expression>, , <value
expression>) function scans the list of <value expression>s from
left to right, determines the highest data type in the list, and returns the
first non-
NULL value in the list, casting it to the highest data type. If all
the
<value expression>s are NULL, the result is NULL.
The most common use of this function is in a
SELECT list, where
there are columns that have to be added, but one can be a
NULL. For
example, to create a report of the total pay for each employee, you might
write this query:
SELECT emp_nbr, emp_name, (salary + commission) AS totalpay
FROM Employees;
But salesmen may work on commission only, or on a mix of salary
and commission. The office staff is on salary only. This means an
employee could have
NULLs in his salary or commission column, which
would propagate in the addition and produce a
NULL result. A better

solution would be:
SELECT emp_nbr, emp_name
(COALESCE(salary, 0) + COALESCE(commission, 0)) AS paycheck
FROM Employees;
A more elaborate use for the COALESCE() function is with aggregate
functions. Consider a table of customers’ purchases with a category code
and the amount of each purchase. You are to construct a query that will

×