awk, Your Programmable Report Generator

As the World of Linux gets ever more sophisticated, I occasionally like to remind myself about the importance of the fundamentals. Back to early principles and concepts that let humans bend those mighty computing machines to their will. One such early idea was that of the command line and all the helpful little programs you’d type in, before the days of window managers and GUIs.

Calculating ReportsOne of my favorite command line programs is awk.

awk is a powerful text manipulation program. Henry McGilton and Rachel Morgan, in Introducing The Unix System (McGraw-Hill 1983) referred to it as “a programmable report-generator”. With it you can search for patterns in text and/or perform relationship testing. Input is either a text file or some type of text stream, possibly originating from another command like ls.

Using awk

In its simplest form, awk simply prints out fields that you specify as it works its way through the file.  For example, say I generate a copy of my current directory and send it to a file with the following command:

rob$  ls -l > lines.txt

The contents of the file might look like this.

-rw-r--r--  1 rob  rob       24735 2013-02-18 16:37 0001036647.PDF
drwxr-xr-x  6 rob  rob        4096 2013-02-18 16:46 Calibre Library
-rw-r--r--  1 rob  rob     4047331 2012-06-20 21:04 capt0000.jpg
-rw-------  1 rob  rob     8064327 2011-07-02 04:12 capt0000.nef
drwxr-xr-x  2 rob  rob        4096 2013-06-06 19:14 captivate-06062012
drwxr-xr-x  2 rob  rob        4096 2012-06-06 19:14 captivate-06062013
-rw-r--r--  1 rob  rob        5729 2011-06-13 12:12 writing.tjp~
-rw-r--r--  1 rob  rob      151552 2011-12-23 16:49 x264_2pass.log.temp
drwxr-xr-x  2 rob  rob        4096 2012-12-29 15:09 xformerroot
-rw-r--r--  1 rob  rob        8871 2013-01-28 17:01 X-Plane Installer Log.txt
drwxr-xr-x  2 rob  rob        4096 2012-02-19 11:29 youtube

We could use awk without any search patterns with the following command:

rob$  awk '/ /' lines.txt

The result is simply all the lines and fields in the file:

-rw-r--r--  1 rob  rob       24735 2013-02-18 16:37 0001036647.PDF
drwxr-xr-x  6 rob  rob        4096 2013-02-18 16:46 Calibre Library
-rw-r--r--  1 rob  rob     4047331 2012-06-20 21:04 capt0000.jpg
-rw-------  1 rob  rob     8064327 2011-07-02 04:12 capt0000.nef
drwxr-xr-x  2 rob  rob        4096 2013-06-06 19:14 captivate-06062012
drwxr-xr-x  2 rob  rob        4096 2012-06-06 19:14 captivate-06062013
-rw-r--r--  1 rob  rob        5729 2011-06-13 12:12 writing.tjp~
-rw-r--r--  1 rob  rob      151552 2011-12-23 16:49 x264_2pass.log.temp
drwxr-xr-x  2 rob  rob        4096 2012-12-29 15:09 xformerroot
-rw-r--r--  1 rob  rob        8871 2013-01-28 17:01 X-Plane Installer Log.txt
drwxr-xr-x  2 rob  rob        4096 2012-02-19 11:29 youtube

Searching with awk

Let’s get a little more complex. This time, add a pattern to find a string anywhere in the lines.txt file.

rob$  awk '/2012/' lines.txt

The output looks like this.

-rw-r--r--  1 rob  rob     4047331 2012-06-20 21:04 capt0000.jpg
drwxr-xr-x  2 rob  rob        4096 2013-06-06 19:14 captivate-06062012
drwxr-xr-x  2 rob  rob        4096 2012-06-06 19:14 captivate-06062013
drwxr-xr-x  2 rob  rob        4096 2012-12-29 15:09 xformerroot
drwxr-xr-x  2 rob  rob        4096 2012-02-19 11:29 youtube

For the moment, let’s switch gears and print out a couple of specific fields. Suppose we just want the dates and their associated file names. In the lines.txt file, those would be the sixth and the eighth fields. Use the built-in field matching feature and print them (with a space inserted, using double quotes, in between, for clarity).

rob$  awk '{print $6 “  “ $8}' lines.txt

Here are the corresponding lines.

2013-02-18 0001036647.PDF
2013-02-18 Calibre
2012-06-20 capt0000.jpg
2011-07-02 capt0000.nef
2013-06-06 captivate-06062012
2012-06-06 captivate-06062013
2011-06-13 writing.tjp~
2011-12-23 x264_2pass.log.temp
2012-12-29 xformerroot
2013-01-28 X-Plane
2012-02-19 youtube

Note that you can also use characters or a string in between the double quotes, just the same as a space.

Remember that I’m using a pretty small lines.txt file. Your lines.txt file could be 10 MB in size. awk would handle that file without a problem. It just starts at the beginning and chugs through until it reaches the end, finding, pattern matching and printing as it goes.

Next, combine the pattern search and field selection into a command. This time just select the file name, field 8.

awk '/2012/ {print $8}' lines.txt

capt0000.jpg
captivate-06062012
captivate-06062013
xformerroot
youtube

What the heck? We have a 2013 in the output! Don’t forget that /2012/ matches lines and fields anywhere in the file. Take a look when we use both the number 6 and number 8 fields.

awk '/2012/ {print $6 “ “ $8}' lines.txt

And the output.

2012-06-20 capt0000.jpg
2013-06-06 captivate-06062012
2012-06-06 captivate-06062013
2012-12-29 xformerroot
2012-02-19 youtube

There’s the 2012. I only mention it because this kind of confusing situation is easy to create but sometimes tough to spot.

Searching by Relationship

A simple example of performing a relationship test might be the following.

awk '$6 > "2012-06-06" {print $6 "  " $8}' lines.txt

2013-02-18  0001036647.PDF
2013-02-18  Calibre
2012-06-20  capt0000.jpg
2013-06-06  captivate-06062012
2012-12-29  xformerroot
2013-01-28  X-Plane

Meantime, using a less-than comparison yields a different result.

awk '$6 < "2012-06-06" {print $6 "  " $8}' lines.txt

2011-07-02  capt0000.nef
2011-06-13  writing.tjp~
2011-12-23  x264_2pass.log.temp
2012-02-19  youtube

Conclusion

All those crazy database tools, word processors and such are great, but sometimes you just need something simple and fast. Command line tools like awk are the answer.

awk’s a great program for quick reports or generating reports from long text files. It has a bunch of options and many programmable features– a few of which we’ll discuss in future stories. Take a look at awk and I’m sure you’ll see many different opportunities to use this powerful tool on your Linux command line.

Comments

  1. BY awk command says:

    Can we use awk command for files which are not in the format of rows and columns?

    • BY perlfan says:

      The examples shown are using regular expressions, not column positions. In awk, perl, and other tools that use regex for searching or line splitting it is a parsing issue. The “columns” in the example are not really character positions, but fields and delimeters where by default the delimiters are spaces and the $1…$n variables are non-spaces in the data stream.

      So the answer is yes because that is what the examples are doing to a continuous stream of bytes in which certain characters (line-feed on Unix) are used to separate one row from another. The appearance of rows in the printed data is an illusion because any character can be defined as a row separator.

      Older languages like COBOL and C can efficiently use actual columns and rows, defined by discrete character positions, because the records are usually obtained by reading a fixed size data chunk into a structure that preassociates the columns in each record with variables, rather than reading a data stream and splitting the stream into fields one variable at a time. Multiple column definitions can be imposed on each record with overlapping structure definitions, thus allowing for different record types in the same file. This is why COBOL is so wordy, but the chunks do not have to be parsed out of the records. Back then computer time was more expensive than developer time.

      It is the difference between “stream” and “record” I/O if you want to research it further.

Post a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>