Part III — Networking with NetWare
Awk, Awk
Perl
The C Programming Language
15
o Awk, Awk
o By Ann Marshall
Overview
Uses
Features
Brief History
Fundamentals
Entering Awk from the Command Line
Files for Input
The Program File
Specifying Output on the Command Line
Patterns and Actions
Input
Fields
Program Format
A Note on awk Error Messages
Print Selected Fields
Program Components
The Input File and Program
Patterns
BEGIN and END
Expressions
String Matching
Range Patterns
Compound Patterns
Actions
Variables
Naming
Awk in a Shell Script
Built-in Variables
Conditions (No IFs, &&s or buts)
The if Statement
The Conditional Statement
Patterns as Conditions
Loops
Increment and Decrement
The While Statement
The Do Statement
The For Statement
Loop Control
Strings
Built-In String Functions
String Constants
Arrays
Array Specialties
Arithmetic
Operators
Numeric Functions
Input and Output
Input
The Getline Statement
Output
The printf Statement
Closing Files and Pipes
Command Line Arguments
Passing Command Line Arguments
Setting Variables on the Command Line
Functions
Function Definition
Parameters
Variables
Function Calls
The Return Statement
Writing Reports
BEGIN and END Revisited
The Built-in System Function
Advanced Concepts
Multi-Line Records
Multidimensional Arrays
Summary
Further Reading
Obtaining Source Code
15
Awk, Awk
By Ann Marshall
Overview
The UNIX utility awk is a pattern matching and processing language with considerably
more power than you may realize. It searches one or more specified files, checking for
records that match a specified pattern. If awk finds a match, the corresponding action is
performed. A simple concept, but it results in a powerful tool. Often an awk program is
only a few lines long, and because of this, an awk program is often written, used, and
discarded. A traditional programming language, such as Pascal or C, would take more
thought, more lines of code, and hence, more time. Short awk programs arise from two of
its built-in features: the amount of predefined flexibility and the number of details that are
handled by the language automatically. Together, these features allow the manipulation
of large data files in short (often single-line) programs, and make awk stand apart from
other programming languages. Certainly any time you spend learning awk will pay
dividends in improved productivity and efficiency.
Uses
The uses for awk vary from the simple to the complex. Originally awk was intended for
various kinds of data manipulation. Intentionally omitting parts of a file, counting
occurrences in a file, and writing reports are naturals for awk.
Awk uses the syntax of the C programming language, so if you know C, you have an idea
of awk syntax. If you are new to programming or don't know C, learning awk will
familiarize you with many of the C constructs.
Examples of where awk can be helpful abound. Computer-aided manufacturing, for
example, is plagued with nonstandardization, so the output of a computer that's running a
particular tool is quite likely to be incompatible with the input required for a different
tool. Rather than write any complex C program, this type of simple data transformation is
a perfect awk task.
One real problem of computer-aided manufacturing today is that no standard format yet
exists for the program running the machine. Therefore, the output from Computer A
running Machine A probably is not the input needed for Computer B running Machine B.
Although Machine A is finished with the material, Machine B is not ready to accept it.
Production halts while someone edits the file so it meets Computer B's needed format.
This is a perfect and simple awk task.
Due to the amount of built-in automation within awk, it is also useful for rapid
prototyping or trying out an idea that could later be implemented in another language.
Features
Reflecting the UNIX environment, awk features resemble the structures of both C and
shell scripts. Highlights include its being flexible, its predefined variables, automation, its
standard program constructs, conventional variable types, its powerful output formatting
borrowed from C, and its ease of use.
The flexibility means that most tasks may be done more than one way in awk. With the
application in mind, the programmer chooses which method to use . The built-in
variables already provide many of the tools to do what is needed. Awk is highly
automated. For instance, awk automatically retrieves each record, separates it into fields,
and does type conversion when needed without programmer request. Furthermore, there
are no variable declarations. Awk includes the "usual" programming constructs for the
control of program flow: an if statement for two way decisions and do, for and while
statements for looping. Awk also includes its own notational shorthand to ease typing.
(This is UNIX after all!) Awk borrows the printf() statement from C to allow "pretty" and
versatile formats for output. These features combine to make awk user friendly.
Brief History
Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan created awk in 1977. (The
name is from the creators' last initials.) In 1985, more features were added, creating nawk
(new awk). For quite a while, nawk remained exclusively the property of AT&T, Bell
Labs. Although it became part of System V for Release 3.1, some versions of UNIX, like
SunOS, keep both awk and nawk due to a syntax incompatibility. Others, like System V
run nawk under the name awk (although System V. has nawk too). In The Free Software
Foundation, GNU introduced their version of awk, gawk, based on the IEEE POSIX
(Institute of Electrical and Electronics Engineers, Inc., IEEE Standard for Information
Technology, Portable Operating System Interface, Part 2: Shell and Utilities Volume 2,
ANSI approved 4/5/93), awk standard which is different from awk or nawk. Linux, PC
shareware UNIX, uses gawk rather than awk or nawk. Throughout this chapter I have
used the word awk when any of the three will do the concept. The versions are mostly
upwardly compatible. Awk is the oldest, then nawk, then POSIX awk, then gawk as
shown below. I have used the notation version++ to denote a concept that began in that
version and continues through any later versions.
NOTE: Due to different syntax, awk code can never be upgraded to nawk.
However, except as noted, all the concepts of awk are implemented in nawk (and gawk).
Where it matters, I have specified the version.
Figure 15.1. The evolution of awk.
Refer to the end of the chapter for more information and further resources on awk and its
derivatives.
Fundamentals
This section introduces the basics of the awk programming language. Although my
discussion first skims the surface of each topic to familiarize you with how awk
functions, later sections of the chapter go into greater detail. One feature of awk that
almost continually holds true is this: you can do most tasks more than one way. The
command line exemplifies this. First, I explain the variety of ways awk may be called
from the command line—using files for input, the program file, and possibly an output
file. Next, I introduce the main construct of awk, which is the pattern action statement.
Then, I explain the fundamental ways awk can read and transform input. I conclude the
section with a look at the format of an awk program.
Entering Awk from the Command Line
In its simplest form, awk takes the material you want to process from standard input and
displays the results to standard output (the monitor). You write the awk program on the
command line. The following table shows the various ways you can enter awk and input
material for processing.
You can either specify explicit awk statements on the command line, or, with the -f flag,
specify an awk program file that contains a series of awk commands. In addition to the
standard UNIX design allowing for standard input and output, you can, of course, use file
redirection in your shell, too, so awk < inputfile is functionally identical to awk inputfile.
To save the output in a file, again use file redirection: awk > outputfile does the trick.
Helpfully, awk can work with multiple input files at once if they are specified on the
command line.
The most common way to see people use awk is as part of a command pipe, where it's
filtering the output of a command. An example is ls -l | awk {print $3} which would print
just the third column of each line of the ls command. Awk scripts can become quite
complex, so if you have a standard set of filter rules that you'd like to apply to a file, with
the output sent directly to the printer, you could use something like awk -f myawkscript
inputfile | lp.
TIP: If you opt to specify your awk script on the command line, you'll find it
best to use single quotes to let you use spaces and to ensure that the command shell
doesn't falsely interpret any portion of the command.
Files for Input
These input and output places can be changed if desired. You can specify an input file by
typing the name of the file after the program with a blank space between the two. The
input file enters the awk environment from your workstation keyboard (standard input).
To signal the end of the input file, type Ctl + d. The program on the command line
executes on the input file you just entered and the results are displayed on the monitor
(the standard output.)
Here's a simple little awk command that echoes all lines I type, prefacing each with the
number of words (or fields, in awk parlance, hence the NF variable for number of fields)
in the line. (Note that Ctrl+d means that while holding down the Control key you should
press the d key).
$ awk '{print $NF : $0}'
I am testing my typing.
A quick brown fox jumps when vexed by lazy ducks.
Ctrl+d
5: I am testing my typing.
10: A quick brown fox jumps when vexed by lazy ducks.
$ _
You can also name more than one input file on the command line, causing the combined
files to act as one input. This is one way of having multiple runs through one input file.
TIP: Keep in mind that the correct ordering on the command line is crucial for
your program to work correctly: files are read from left to right, so if you want to have
file1 and file2 read in that order, you'll need to specify them as such on the command
line.
The Program File
With awk's automatic type conversion, a file of names and a file of numbers entered in
the reverse order at the command line generate strange-looking output rather than an
error message. That is why for longer programs, it is simpler to put the program in a file
and specify the name of the file on the command line. The -f option does this. Notice that
this is an exception to the usual way UNIX handles options. Usually the options occur at
the end of a command; however, here an input file is the last parameter.
NOTE: Versions of awk that meet the POSIX awk specifications are allowed to
have multiple -f options. You can use this for running multiple programs using the same
input.
Specifying Output on the Command Line
Output from awk may be redirected to a file or piped to another program (see Chapter 4).
The command awk /^5/ {print $0} | grep 3, for example, will result in just those lines that
start with the digit five (that's what the awk part does) and also contain the digit three (the
grep command). If you wanted to save that output to a file, by contrast, you could use
awk /^5/ {print $0} > results and the file results would contain all lines prefaced by the
digit 5. If you opt for neither of these courses, the output of awk will be displayed on
your screen directly, which can be quite useful in many instances, particularly when
you're developing—or fine tuning—your awk script.
Patterns and Actions
Awk programs are divided into three main blocks; the BEGIN block, the per-statement
processing block, and the END block. Unless explicitly stated, all statements to awk
appear in the per-statement block (you'll see later where the other blocks can come in
particularly handy for programming, though).
Statements within awk are divided into two parts: a pattern, telling awk what to match,
and a corresponding action, telling awk what to do when a line matching the pattern is
found. The action part of a pattern action statement is enclosed in curly braces ({}) and
may be multiple statements. Either part of a pattern action statement may be omitted. An
action with no specified pattern matches every record of the input file you want to search
(that's how the earlier example of {print $0} worked). A pattern without an action
indicates that you want input records to be copied to the output file as they are (i.e.,
printed).
The example of /^5/ {print $0} is an example of a two-part statement: the pattern here is
all lines that begin with the digit five (the ^ indicates that it should appear at the
beginning of the line: without it the pattern would say any line that includes the digit five)
and the action is print the entire line verbatim. ($0 is shorthand for the entire line.)
Input
Awk automatically scans, in order, each record of the input file looking for each pattern
action statement in the awk program. Unless otherwise set, awk assumes each record is a
single line. (See the sections "Advanced Concepts","Multi-line Records" for how to
change this.) If the input file has blank lines in it, the blank lines count as a record too.
Awk automatically retrieves each record for analysis; there is no read statement in awk.
A programmer may also disrupt the automatic input order in of two ways: the next and
exit statements. The next statement tells awk to retrieve the next record from the input
file and continue without running the current input record through the remaining portion
of pattern action statements in the program. For example, if you are doing a crossword
puzzle and all the letters of a word are formed by previous words, most likely you
wouldn't even bother to read that clue but simply skip to the clue below; this is how the
next statement would work, if your list of clues were the input. The other method of
disrupting the usual flow of input is through the exit statement. The exit statement
transfers control to the END block—if one is specified—or quits the program, as if all the
input has been read; suppose the arrival of a friend ends your interest in the crossword
puzzle, but you still put the paper away. Within the END block, an exit statement causes
the program to quit.
An input record refers to the entire line of a file including any characters, spaces, or Tabs.
The spaces and tabs are called whitespace.
TIP: If you think that your input file may include both spaces and tabs, you
can save yourself a lot of confusion by ensuring that all tabs become spaces with the
expand program. It works like this: expand filename | awk { stuff }.
The whitespace in the input file and the whitespace in the output file are not related and
any whitespace you want in the output file, you must explicitly put there.
Fields
A group of characters in the input record or output file is called a field. Fields are
predefined in awk: $1 is the first field, $2 is the second, $3 is the third, and so on. $0
indicates the entire line. Fields are separated by a field separator (any single character
including Tab), held in the variable FS. Unless you change it, FS has a space as its value.
FS may be changed by either starting the programfile with the following statement:
BEGIN {FS = "char" }
or by setting the -Fchar command line option where char is the selected field separator
character you want to use.
One file that you might have viewed which demonstrates where changing the field
separator could be helpful is the /etc/passwd file that defines all user accounts. Rather
than having the different fields separated by spaces or tabs, the password file is structured
with lines:
news:?:6:11:USENET News:/usr/spool/news:/bin/ksh
Each field is separated by a colon! You could change each colon to a space (with sed, for
example), but that wouldn't work too well: notice that the fifth field, USENET News,
contains a space already. Better to change the field separator. If you wanted to just have a
list of the fifth fields in each line, therefore, you could use the simple awk command awk
-F: {print $5} /etc/passwd.
Likewise, the built-in variable OFS holds the value of the output field separator. OFS also
has a default value of a space. It, too, may be changed by placing the following line at the
start of a program.
BEGIN {OFS = "char" }
If you want to automatically translate the passwd file so that it listed only the first and
fifth fields, separated by a tab, you can therefore use the awk script:
BEGIN { FS=":" ; OFS=" " }
{ print $1, $5 }
Notice here that the script contains two blocks: the BEGIN block and the main per-input
line block. Also notice that most of the work is done automatically.
Program Format
With a few noted exceptions, awk programs are free format. The interpreter ignores any
blank lines in a programfile. Add them to improve the readability of your program
whenever you wish. The same is true for Tabs and spaces between operators and the parts
of a program. Therefore, these two lines are treated identically by the awk interpreter.
$4 == 2 {print "Two"}
$4 == 2 { print "Two" }
If more than one pattern action line appears on a line, you'll need to separate them with a
semicolon, as shown above in the BEGIN block for the passwd file translator. If you stick
with one-command-per-line then you won't need to worry too much about the
semicolons. There are a couple of spots, however, where the semicolon must always be
used: before an else statement or when included in the syntax of a statement. (See the
"Loops" or "The Conditional Statement" sections.) However, you may always put a
semicolon at the end of a statement.
The other format restriction for awk programs is that at least the opening curly bracket of
the action half of a pattern action statement must be on the same line as the
accompanying pattern, if both pattern and action exist. Thus, following examples all do
the same thing.
The first shows all statements on one line:
$2==0 {print ""; print ""; print "";}
The second with the first statement on the same line as the pattern to match:
$2==0 { print ""
print ""
print ""}
and finally as spread out as possible:
$2==0 {
print ""
print ""
print ""
}
When the second field of the input file is equal to 0, awk prints three blank lines to the
output file.
NOTE: Notice that print "" prints a blank line to the output file, whereas the
statement print alone prints the current input line.
When you look at an awk program file, you may also find commentary within. Anything
typed from a # to the end of the line is considered a comment and is ignored by awk.
They are notes to anyone reading the program to explain what is going on in words, not
computerese.
A Note on awk Error Messages
Awk error messages (when they appear) tend to be cryptic. Often, due to the brevity of
the program, a typo is easily found. Not all errors are as obvious; I have scattered some
examples of errors throughout this chapter.
Print Selected Fields
Awk includes three ways to specify printing. The first is implied. A pattern without an
action assumes that the action is to print. The two ways of actively commanding awk to
print are print and printf(). For now, I am going to stick to using only implied printing
and the print statement. printf is discussed in a later section ("Input/Output") and is used
mainly for precise output. This section demonstrates the first two types of printing
through some step-by-step examples.
Program Components
If I want to be sure the System Administrator spelled my name correctly in the
/etc/password file, I enter an awk command to find a match but omit an action. The
following command line puts a list on-screen.
$ awk '/Ann/' /etc/passwd
amarshal:oPWwC9qVWI/ps:2005:12:Ann Marshall:/usr/grad/amarshal:/bin/csh
andhs26:0TFnZSVwcua3Y:2488:23:DeAnn
O'Neal:/usr/lstudent/andhs26:/bin/csh
alewis:VYfz4EatT4OoA:2623:22:Annie Lewis:/usr/lteach/alewis:/bin/csh
cmcintyr:0FciKEDDMkauU:2630:22:Carol Ann
McIntyre:/usr/lteach/cmcintyr:/bin/csh
jflanaga:ShrMnyDwLI/mM:2654:22:JoAnn
Flanagan:/usr/lteach/jflanaga:/bin/csh
lschultz:mic35ZiFj9zWk:3060:22:Lee Ann Schultz,
:/usr/lteach/lschultz:/bin/csh
akestle:job57Lb5/ofoE:3063:22:Ann Kestle.:/usr/lteach/akestle:/bin/csh
bakehs59:yRYV6BtcW7wFg:3075:23:DeAnna Adlington, Baker
:/usr/bakehs59:/bin/csh
ahernan:AZZPQNCkw6ffs:3144:23:Ann
Hernandez:/usr/lstudent/ahernan:/bin/csh
$ _
I look on the monitor and see the correct spelling.
ERROR NOTE: For the sake of making a point, suppose I had chosen the pattern
/Anne/. A quick glance above shows that there would be no matches. Entering awk
'/Anne/' /etc/passwd will therefore produce nothing but another system prompt to the
monitor. This can be confusing if you expect output. The same goes the other way;
above, I wanted the name Ann, but the names LeAnn, Annie and DeAnna matched, too.
Sometimes choosing a pattern too long or too short can cause an unneeded headache.
TIP: If a pattern match is not found, look for a typo in the pattern you are
trying to match.
Printing specified fields of an ASCII (plain text) file is a straightforward awk task.
Because this program example is so short, only the input is in a file. The first input file,
"sales", is a file of car sales by month. The file consists of each salesperson's name,
followed by a monthly sales figure. The end field is a running total of that person's total
sales.
The Input File and Program
$cat sales
John Anderson,12,23,7,42
Joe Turner,10,25,15,50
Susan Greco,15,13,18,46
Bob Burmeister,8,21,17,46
The following command line prints the salesperson's name and the total sales for the first
quarter.
awk -F, '{print $1,$5}' sales
John Anderson 42
Joe Turner 50
Susan Greco 46
Bob Burmeister 46
A comma (,) between field variables indicates that I want OFS applied between output
fields as shown in a previous example. Remember without the comma, no field separator
will be used, and the displayed output fields (or output file) will all run together.
TIP: Putting two field separators in a row inside a print statement creates a
syntax error with the print statement; however, using the same field twice in a single print
statement is valid syntax. For example:
awk '{print($1,$1)'
Patterns
A pattern is the first half of an awk program statement. In awk there are six accepted
pattern types. This section discusses each of the six in detail. You have already seen a
couple of them, including BEGIN, and a specified, slash-delimited pattern, in use. Awk
has many string matching capabilities arising from patterns, and the use of regular
expressions in patterns. A range pattern locates a sequence. All patterns except range
patterns may be combined in a compound pattern.
I began the chapter by saying awk was a pattern-match and process language. This
section explores exactly what is meant by a pattern match. As you'll see, what kind
pattern you can match depends on exactly how you're using the awk pattern specification
notation.
BEGIN and END
The two special patterns BEGIN and END may be used to indicate a match, either before
the first input record is read, or after the last input record is read, respectively. Some
versions of awk require that, if used, BEGIN must be the first pattern of the program and,
if used, END must be the last pattern of the program. While not necessarily a
requirement, it is nonetheless an excellent habit to get into, so I encourage you to do so,
as I do throughout this chapter. Using the BEGIN pattern for initializing variables is
common (although variables can be passed from the command line to the program too;
see "Command Line Arguments") The END pattern is used for things which are input-
dependent such as totals.
If I want to know how many lines are in a given program, I type the following line:
$awk 'END {print _Total lines: _$NR}' myprogram
I see Total lines: 256 on the monitor and therefore know that the file myprogram has 256
lines. At any point while awk is processing the file, the variable NR counts the number of
records read so far. NR at the end of a file has a value equal to the number of lines in the
file.
How might you see a BEGIN block in use? Your first thought might be to initialize
variables, but if it's a numeric value, it's automatically initialized to zero before its first
use. Instead, perhaps you're building a table of data and want to have some columnar
headings. With this in mind, here's a simple awk script that shows you all the accounts
that people named Dave have on your computer:
BEGIN {
FS=_:_ # remember that the passwd file uses colons
OFS=_ _ # we_re setting the output to a TAB
print _Account_,_Username_
}
/Dav/ {print $1, $5}
Here's what it looks like in action (we've called this file _daves.awk_, though the
program matches Dave and David, of course):
$ awk -f daves.awk /etc/passwd
Account Username
andrews Dave Andrews
d3 David Douglas Dunlap
daves Dave Smith
taylor Dave Taylor
Note that you could also easily have a summary of the total number of matched accounts
by adding a variable that's incremented for each match, then in the END block output in
some manner. Here's one way to do it:
BEGIN { FS=_:_ ; OFS=_ _ # input colon separated, output tab
separated
print _Account_,_Username_
}
/Dav/ {print $1, $5 ; matches++ }
END { print _A total of _matches_ matches._}
Here you can see how awk allows you to shorten the length of programs by having
multiple items on a single line, particularly useful for initialization. Also notice the C
increment notation: _matches++_ is functionally identical to _matches = matches + 1_.
Finally, also notice that we didn't have to initialize the variable _matches_ to zero since it
was done for us automatically by the awk system.
Expressions
Any expression may be used with any operator in awk. An expression consists of any
operator in awk, and its corresponding operand in the form of a pattern-match statement.
Type conversion—variables being interpreted as numbers at one point, but strings at
another—is automatic, but never explicit. The type of operand needed is decided by the
operator type. If a numeric operator is given a string operand, it is converted and vice
versa.
TIP: To force a conversion, if the desired change is string to number, add (+)
0. If you wish to explicitly convert a number to a string concatenate "" (the null string) to
the variable. Two quick examples: num=3; num=num __ creates a new numeric variable
and sets it to the number three, then by appending a null string to it, translates it to a
string (e.g., the string with the character 3 within). Adding zero to that string —
num=num + 0 — forces it back to a numeric value.
Any expression can be a pattern. If the pattern, in this case the expression, evaluates to a
nonzero or nonnull value, then the pattern matches that input record. Patterns often
involve comparison. The following are the valid awk comparison operators:
Table 15.1. Comparison Operators in awk.
Operator
Meaning
==
is equal to
<
less than
>
greater than
<=
less than or equal to
>=
greater than or equal to
!=
not equal to
~
matched by
!~
not matched by
In awk, as in C, the logical equality operator is == rather than =. The single = compares
memory location, whereas == compares values. When the pattern is a comparison, the
pattern matches if the comparison is true (non-null or non-zero). Here's an example: what
if you wanted to only print lines where the first field had a numeric value of less than
twenty? No problem in awk:
$1 < 20 {print $0}
If the expression is arithmetic, it is matched when it evaluates to a nonzero number. For
example, here's a small program that will print the first ten lines that have exactly seven
words:
BEGIN {i=0}
NF==7 { print $0 ; i++ }
/i==10/ {exit}
There's another way that you could use these comparisons too, since awk understands
collation orders (that is, whether words are greater or lesser than other words in a
standard dictionary ordering). Consider the situation where you have a phone directory—
a sorted list of names—in a file and want to print all the names that would appear in the
corporate phonebook before a certain person, say D. Hughes. You could do this quite
succinctly:
$1 >= "Hughes,D" { exit }
When the pattern is a string, a match occurs if the expression is non-null. In the earlier
example with the pattern /Ann/, it was assumed to be a string since it was enclosed in
slashes. In a comparison expression, if both operands have a numeric value, the
comparison is based on the numeric value. Otherwise, the comparison is made using
string ordering, which is why this simple example works.
TIP: You can write more than two comparisons to a line in awk.
The pattern $2 <= $1 could involve either a numeric comparison or a string comparison.
Whichever it is, it will vary from file to file or even from record to record within the
same file.
TIP: Know your input file well when using such patterns, particularly since
awk will often silently assume a type for the variable and work with it, without error
messages or other warnings.
String Matching
There are three forms of string matching. The simplest is to surround a string by slashes
(/). No quotation marks are used. Hence /"Ann"/ is actually the string ' "Ann" ' not the
string Ann, and /"Ann"/ returns no input. The entire input record is returned if the
expression within the slashes is anywhere in the record. The other two matching
operators have a more specific scope. The operator ~ means "is matched by," and the
pattern matches when the input field being tested for a match contains the substring on
the right hand side.
$2 ~ /mm/
This example matches every input record containing mm somewhere in the second field.
It could also be written as $2 ~ "mm".
The other operator !~ means "is not matched by."
$2 !~ /mm/
This example matches every input record not containing mm anywhere in the second
field.
Armed with that explanation, you can now see that /Ann/ is really just shorthand for the
more complex statement $0 ~ /Ann/.
Regular expressions are common to UNIX, and they come in two main flavors. You have
probably used them unconsciously on the command line as wildcards, where * matches
zero or more characters and ? matches any single character. For instance entering the first
line below results in the command interpreter matching all files with the suffix abc and
the rm command deleting them.
rm *abc
Awk works with regular expressions that are similar to those used with grep, sed, and
other editors but subtly different than the wildcards used with the command shell. In
particular, . matches a character and * matches zero or more of the previous character in
the pattern (so a pattern of x*y will match anything that has any number of the letter x
followed by a y. To force a single x to appear too, you'd need to use the regular
expression xx*y instead). By default, patterns can appear anywhere on the line, so to
have them tied to an edge, you need to use ^ to indicate the beginning of the word or line,
and $ for the end. If you wanted to match all lines where the first word ends in abc, for
example, you could use $1 ~ /abc$/. The following line matches all records where the
fourth field begins with the letter a:
$4 ~ /^a.*/
Range Patterns
The pattern portion of a pattern/action pair may also consist of two patterns separated by
a comma (,); the action is performed for all lines between the first occurrence of the first
pattern and the next occurrence of the second.
At most companies, employees receive different benefits according to their respective
hire dates. It so happens that I have a file listing all employees in my company, including
hire date. If I wanted to write an awk program that just lists the employees hired between
1980 and 1987 I could use the following script, if the first field is the employee's name
and the third field is the year hired. Here's how that data file might look (notice that I use
: to separate fields so that we don't have to worry about the spaces in the employee
names)
$ cat emp.data.
John Anderson:sales:1980
Joe Turner:marketing:1982
Susan Greco:sales:1985
Ike Turner:pr:1988
Bob Burmeister:accounting:1991
The program could then be invoked:
$ awk -F: '$3 > 1980,$3 < 1987 {print $1, $3}' emp.data
With the output:
John Anderson 1980
Joe Turner 1982
Susan Greco 1985
TIP: The above example works because the input is already in order according
to hire year. Range patterns often work best with pre-sorted input. This particular data file
would be a bit tricky to sort within UNIX, but you could use the rather complex
command sort -c: +3 -4 -rn emp.data > new.emp.data to sort things correctly. (See
Chapter 6 for more details on using the powerful sort command.)
Notice range patterns are inclusive—they include both the first item matched and the end
data indicated in the pattern. The range pattern matches all records from the first
occurrence of the first pattern to the first occurrence of the second. This is a subtle point,
but it has a major affect on how range patterns work. First, if the second pattern is never
found, all remaining records match. So given the input file below:
$ cat sample.data
1
3
5
7
9
11
The following output appears on the monitor, totally disregarding that 9 and 11 are out of
range.
$ awk '$1==3, $1==8' file1 sample.data
3
5
7
9
11
The end pattern of a range is not equivalent to a <= operand, though liberal use of these
patterns can alleviate the problem, as shown in the employee hire date example above.
Secondly, as stated, the pattern matches the first range; others that might occur later in
the data file are ignored. That's why you have to make sure that the data is sorted as you
expect.
CAUTION: Range patterns cannot be parts of a larger pattern.
A more useful example of the range pattern comes from awk's ability to handle multiple
input files. I have a function finder program that finds code segments I know exist and
tells me where they are. The code segments for a particular function X, for example, are
bracketed by the phrase "function X" at the beginning and } /* end of X at the end. It can
be expressed as the awk pattern range:
'/function functionname/,/} \/* end of functionname/'
Compound Patterns
Patterns can be combined using the following logical operators and parentheses as
needed.
Table 15.2. The Logical Operators in awk.
Operator
Meaning
!
not
||
or (you can also use | in regular expressions)
&&
and
The pattern may be simple or quite complicated: (NF<3) || (NF >4). This matches all
input records not having exactly four fields. As is usual in awk, there are a wide variety
of ways to do the same thing (specify a pattern). Regular expressions are allowed in
string matching, but their use is not forced. To form a pattern that matches strings
beginning with a or b or c or d, there are several pattern options:
/^[a-d].*/
/^a.*/ !! /^b.*/ || /^c.*/ || /^d.*/
NOTE: When using range patterns: $1==2, $1==4 and $1>= 2 && $1 <=4 are not
the same ranges at all. First, the range pattern depends on the occurrence of the second
pattern as a stop marker, not on the value indicated in the range. Secondly, as I mentioned
earlier, the first pattern only matches the first range, others are ignored.
For instance, consider the following simple input file:
$ cat mydata
1 0
3 1
4 1
5 1
7 0
4 2
5 2
1 0
4 3
The first range I try, '$1==3,$1==5, produces:
$ awk '$1==3,$1==5' mydata
3 1
4 1
5 1
Compare this to the following pattern and output.
$ awk '$1>=3 && $1<=5' mydata
3 1
4 1
5 1
4 2
5 2
4 3
Range patterns cannot be parts of a combined pattern.
Actions
The remainder of this chapter explores the action part of a pattern action statement. As
the name suggests, the action part tells awk what to do when a pattern is found. Patterns
are optional. An awk program built solely of actions looks like other iterative
programming languages. But looks are deceptive—even without a pattern, awk matches
every input record to the first pattern action statement before moving to the second.
Actions must be enclosed in curly braces ({}) whether accompanied by a pattern or alone.
An action part may consist of multiple statements. When the statements have no pattern
and are single statements (no compound loops or conditions), brackets for each individual
action are optional provided the actions begin with a left curly brace and end with a right
curly brace. Consider the following two action pieces:
{name = $1
print name}
and
{name = $1}
{print name},
These two produce identical output.
Variables
An integral part of any programming language are variables, the virtual boxes within
which you can store values, count things, and more. In this section, I talk about variables
in awk. Awk has three types of variables: user-defined variables, field variables, and
predefined variables that are provided by the language automatically. The next section is
devoted to a discussion of built-in variables. Awk doesn't have variable declarations. A
variable comes to life the first time it is mentioned; in a twist on René Descarte's
philosophical conundrum, you use it, therefore it is. The section concludes with an
example of turning an awk program into a shell script.
CAUTION: Since there are no declarations, be doubly careful to initialize all the
variables you use, though you can always be sure that they automatically start with the
value zero.
Naming
The rule for naming user-defined variables is that they can be any combination of letters,
digits, and underscores, as long as the name starts with a letter. It is helpful to give a
variable a name indicative of its purpose in the program. Variables already defined by
awk are written in all uppercase. Since awk is case-sensitive, ofs is not the same variable
as OFS and capitalization (or lack thereof) is a common error. You have already seen
field variables—variables beginning with $, followed by a number, and indicating a
specific input field.
A variable is a number or a string or both. There is no type declaration, and type
conversion is automatic if needed. Recall the car sales file used earlier. For illustration
suppose I enter the program awk -F: { print $1 * 10} emp.data, and awk obligingly
provides the rest:
0
0
0
0
0
Of course, this makes no sense! The point is that awk did exactly what it was asked
without complaint: it multiplied the name of the employee times ten, and when it tried to
translate the name into a number for the mathematical operation it failed, resulting in a
zero. Ten times zero, needless to say, is zero
Awk in a Shell Script
Before examining the next example, review what you know about shell programming
(Chapters 10-14). Remember, every file containing shell commands needs to be changed
to an executable file before you can run it as a shell script. To do this you should enter
chmod +x filename from the command line.
Sometimes awk's automatic type conversion benefits you. Imagine that I'm still trying to
build an office system with awk scripts and this time I want to be able to maintain a
running monthly sales total based on a data file that contains individual monthly sales. It
looks like this:
cat monthly.sales
John Anderson,12,23,7
Joe Turner,10,25,15
Susan Greco,15,13,18
Bob Burmeister,8,21,17
These need to be added together to calculate the running totals for each person's sales. Let
a program do it!
$cat total.awk
BEGIN {OFS=,} #change OFS to keep the file format the same.
{print $1, " monthly sales summary: " $2+$3+$4 }
That's the awk script, so let's see how it works:
$ awk -f total.awk monthly.sales
cat sales
John Anderson, monthly sales summary: 42
Joe Turner, monthly sales summary: 50
Susan Greco, monthly sales summary: 46
Bob Burmeister, monthly sales summary: 46
CAUTION: Always run your program once to be sure it works before you make it
part of a complicated shell script!
Your task has been reduced to entering the monthly sales figures in the sales file and
editing the program file total to include the correct number of fields (if you put a for loop
for(i=2;i<+NF;i++) the number of fields is correctly calculated, but printing is a hassle
and needs an if statement with 12 else if clauses).
In this case, not having to wonder if a digit is part of a string or a number is helpful. Just
keep an eye on the input data, since awk performs whatever actions you specify,
regardless of the actual data type with which you're working.
Built-in Variables
This section discusses the built-in variables found in awk. Because there are many
versions of awk, I included notes for those variables found in nawk, POSIX awk, and
gawk since they all differ. As before, unless otherwise noted, the variables of earlier
releases may be found in the later implementations. Awk was released first and contains
the core set of built-in variables used by all updates. Nawk expands the set. The POSIX
awk specification encompasses all variables defined in nawk plus one additional variable.
Gawk applies the POSIX awk standards and then adds some built-in variables which are
found in gawk alone; the built-in variables noted when discussing gawk are unique to
gawk. This list is a guideline not a hard and fast rule. For instance, the built-in variable
ENVIRON is formally introduced in the POSIX awk specifications; it exists in gawk; it is
in also in the System V implementation of nawk, but SunOS nawk doesn't have the
variable ENVIRON. (See the section "'Oh man! I need help.'"in Chapter 5 for more
information on how to use man pages).
As I stated earlier, awk is case sensitive. In all implementations of awk, built-in variables
are written entirely in upper case.
Built-in Variables for Awk
When awk first became a part of UNIX, the built-in variables were the bare essentials. As
the name indicates, the variable FILENAME holds the name of the current input file.
Recall the function finder code; type the new line below:
/function functionname/,/} \/* end of functionname/' {print $0}
END {print ""; print "Found in the file " FILENAME}
This adds the finishing touch.
The value of the variable FS determines the input field separator. FS has a space as its
default value. The built-in variable NF contains the number of fields in the current record
(remember, fields are akin to words, and records are input lines). This value may change
for each input record.
What happens if within an awk script I have the following statement?
$3 = "Third field"
It reassigns $3 and all other field variables, also reassigning NF to the new value. The
total number of records read may be found in the variable NR. The variable OFS holds
the value for the output field separator. The default value of OFS is a space. The value for
the output format for numbers resides in the variable OFMT which has a default value of
%.6g. This is the format specifier for the print statement, though its syntax comes from
the C printf format string. ORS is the output record separator. Unless changed, the value
of ORS is newline(\n).
Built-in Variables for Nawk
NOTE: When awk was expanded in 1985, part of the expansion included adding
more built-in variables.
CAUTION: Some implementations of UNIX simply put the new code in the spot
for the old code and didn't bother keeping both awk and nawk. System V and SunOS
have both available. Linux has neither awk nor nawk but uses gawk. System V has both,
but the awk uses nawk expansions. The book "awk the programming language" by the
awk authors speaks of awk throughout the book, but the programming language it
describes is called nawk on most systems.
The built-in variable ARGC holds the value for the number of command line arguments.
The variable ARGV is an array containing the command line arguments. Subscripts for
ARGV begin with 0 and continue through ARGC-1. ARGV[0] is always awk. The
available UNIX options do not occupy ARGV. The variable FNR represents the number
of the current record within that input file. Like NR, this value changes with each new