mintCast 144: grepsedAWK

News:

[ 9:20] Developers at the Treasury Board of Canada create popular open source project. (wired.com)
[13:10] Fedora 18 – Spherical Cow will be released on Jan 15th. (linuxuser.co.uk)
[14:45] Lego goes Linux. (internetnews.com)
[19:02] IBM’s Watson undergoes brainwashing to forget all the naughty words in the Urban Dictionary. (techdirt.com)
[22:00] The White House responds to a petition calling for the construction of a Death Star. (techcrunch.com) (petitions.whitehouse.gov)

The Main Topic:

[26:50] Regular Expressions

Online Resources:

Regular Espressions Tutorial (regular-expressions.info)
Regular expressions – An introduction (aivosto.com)
An Introduction to Regular Expressions (codeproject.com)
Introduction to Regular Expressions (codular.com)
Regular Expressions (grymoire.com)

From Wikipedia:In computing, a regular expression is a specific pattern that provides concise and flexible means to “match” (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. Common abbreviations for “regular expression” include regex and regexp.

Basic ideas:

In it’s simplest form, a regular expression is a string of symbols to match “as is” (e.g., “mint” would match those four characters)

Quantifiers let you match more than one character:

* matches any number of what’s before it, from zero to infinity.
? matches zero or one.
+ matches one or more.
{n} matches exactly “n” occurances
{n,m} matches at least “n” and not more than “m” occurances

Some special characters are used to match things:

. – The dot matches any single character.
n – Matches a newline character (or CR+LF combination).
t – Matches a tab (ASCII 9).
d – Matches a digit [0-9].
D – Matches a non-digit.
w – Matches an alphanumberic character.
W – Matches a non-alphanumberic character.
s – Matches a whitespace character.
S – Matches a non-whitespace character.
– “Escape” special characters. For example, . matches a dot, and \ matches a backslash.
^ – Match at the beginning of the input string.
$ – Match at the end of the input string.

Group characters by putting them between square brackets. This way, any character in the class will match one character in the input.

[abc] Match any of a, b, and c.
[a-z] Match any character between a and z. (ASCII order)
[^abc] A caret ^ at the beginning indicates “not”.
[+*?.] Most special characters have no meaning inside the square brackets.

Group expressions using parentheses “(“ and “)”. The vertical bar “|” is a Boolean OR operator

[35:30] grep

Online Resources:

GNU Grep (gnu.org)
Linux Guide: An Introduction to grep (brad4l.hubpages.com)
How to use grep (sdsu.edu)
Drew’s grep tutorial (uccs.edu)

The grep command searches one or more input files for lines containing a match to a specified pattern. By default, grep prints the matching lines. If no filename is given on the command line, grep searches standard input.

You use grep in the following manner:

$ grep [OPTIONS] PATTERN [FILENAME …]

Common OPTIONS include:

-h – if you search more than one file at a time, the results contain the name of the file from which the string was found. (See the example using ‘quite the’). This option turns off that feature, giving you only the lines without the file name.
-n – precedes each line with the line number where it was found
-i – tells grep to ignore case so that it treats “the” and “The” as the same word
-l – displays a list of files that contain the string
-w – restricts the search to whole words only

PATTERN is a regular expression. grep understands “basic” (BRE), “Extended” (ERE) and “perl” (PRCE) expressions. In GNU grep, there is no difference between the basic and extended syntax.

The “-f FILE” or “–file=FILE” option allows you to specify the name of a FILE containing regular expressions to be used for PATTERN, one per line.

A common way to use grep is as a filter on the output from another program. This is the most common way people first encounter grep.

see if a process named “firefox” is running:

$ ps -A | grep firefox

get kernel messages related to USB devices:

$ dmesg | grep -i usb

show the serial ports on the machine:

$ dmesg | grep -i tty

Show how much RAM is available on the system:

$ dmesg | grep -i memory

[44:55] sed

Online Resources:

sed, a stream editor (gnu.org)
SED – The Stream Editor (grymoire.com)

sed (stream editor) is a Unix utility that parses text and implements a programming language which can apply transformations to such text. It reads input line by line (sequentially), applying the operation which has been specified via the command line (or a sed script), and then outputs the line. It was developed from 1973 to 1974 as a Unix utility by Lee E. McMahon of Bell Labs, and is available today for most operating systems.

Sed and Awk (or Gawk) both have their origins in the line editor ed.

Sed works by specifying a pattern to match, and a procedure (or action) to perform, as does Awk.

There are two ways to invoke sed and awk: either you specify your editing instructions on the command line or you put them in a file and supply the name of the file.

sed is very useful for transforming text in a file or series of files. There are several usages that are very common. Perhaps the most common use is for substitution, accomplished like this:

sed s/pattern to match/pattern to replace with/ input file

Printing is another common usage:

Print the single line that corresponds to line-number:

sed -n line_numberp

prints lines 1 thru 10:

sed -n 1,10p

In this example, the -n option suppresses the default output of all lines, while the p option prints the matching line.

By default, sed directs all output to STDOUT. If you want to capture the output, you need to redirect it to a file with the > or >> symbols and a filename.

The -i switch allows for in-place editing, rather than directing output to STDOUT

sed acts on each line in a file, reading the line into a buffer, then applying the specified actions to that line before moving on to the next line. You can specify more than one action to be performed on a line using the -e option

Other useful sed commands allow you to delete, append, insert, list, print line number and more.

[51:49] AWK

Online Resources:

The GNU Awk User’s Guide (gnu.org)
AWK – The basic power tool for UNIX (grymoire.com)

The basic function of awk is to search files for lines (or other units of text) that contain certain patterns. When a line matches one of the patterns, awk performs specified actions on that line. awk keeps processing input lines in this way until it reaches the end of the input files.

The name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger and Brian W. Kernighan. The original version of awk was written in 1977 at AT&T Bell Laboratories. In 1985, a new version made the programming language more powerful, introducing user-defined functions, multiple input streams, and computed regular expressions.

Paul Rubin wrote the GNU implementation, gawk, in 1986. Jay Fenlason completed it, with advice from Richard Stallman. John Woods contributed parts of the code as well. In 1988 and 1989, David Trueman and Arnold Robbins thoroughly reworked gawk for compatibility with the newer awk.

Current development focuses on bug fixes, performance improvements, standards compliance, and occasionally, new features. In May of 1997, Jürgen Kahrs felt the need for network access from awk, and with help Robbins, set about adding features to do this for gawk. At that time, he also wrote the bulk of TCP/IP Internetworking with gawk (a separate document, available as part of the gawk distribution). His code finally became part of the main gawk distribution with gawk version 3.1. John Haque rewrote the gawk internals, in the process providing an awk-level debugger. This version became available as gawk version 4.0, in 2011.

On a Mint 14 Mate machine, awk is a symbolic link to gawk, or GNU awk version 4.0.1. For simplicity’s sake, I will be using the term awk to refer to the utility from here on out.

“AWK is a language for processing text files. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.” – Alfred V. Aho

An AWK program is a series of pattern action pairs, written as:

''condition'' { ''action'' }

where condition is typically an expression and action is a series of commands. The input is split into records, where by default records are separated by newline characters so that the input is split into lines. The program tests each record against each of the conditions in turn, and executes the action for each expression that is true. Either the condition or the action may be omitted. The condition defaults to matching every record. The default action is to print the record.

AWK also allows for the inclusion of a BEGIN and or END procedure to be performed before or after the condition/action piece. Once use of this functionality can be to include headers and footers in the output of your awk command or script.

AWK uses a space or tab as its default delimiter, but you can set it to anything you want by using the -F option.

Like sed, awk can print using the following syntax:

Print entire input_file:

$ awk ‘{print $0}’ input_file

Print the first and fourth fields of each line in input_file:

awk ‘{print $1, $4}’ input_file

There are many different different ways awk can be leveraged. Just a few include the following:

Add line numbers to a files or files
Double or triple-space a file
Print the total number of words in a file
Convert Unix newlines to Dos (and vice versa)
Delete trailing whitespace from the end of each line
Add spaces or tabs to the beginning of each line
Align text to the left, right or in the center of each line
Perform a ”find and replace” on each line (similar to sed)
Emulate head, tail, uniq and grep

awk is actually a full-fledged programming language.

[58:20] Case Study – Using grep, sed and AWK

I need to examine the history.log file located in /var/log/apt in order to determine what updates came with LMDE UP 6. The log files is very detailed and it is hard to discern the needed information. Chopping the data up and presenting it in a different manner could be very helpful.

First, we determine how many lines there are in the file:
$ awk 'END { print NR }' history.log

Now we give each line a number. Sed and AWK don’t need this, but it will help us to identify which lines we need. Remember that AWK does not change the original file and sends its output to STDOUT, so we will need to redirect the output to a new file.

$ awk '{ print NR, $0 }' history.log >> num_history.log

We can also determine needed line numbers by using grep. Knowing the structure of the history.log file, we can search for instances of “Start-Date” using this command:

$ grep -n “Start-Date” history.log

Once we determine the date of the upgrade, we can grep for that, allowing us to determine what lines we need to get to identify the installed and upgraded packages.

Now, we copy out the chunk of the log that details UP 6 (you will need to look at the file to determine on what line the update begins and ends. We can then copy those lines to a temp file with this command:

$ sed -n '122,125p' num_history.log >> temp_history.log

Now we have the UP 6 log information in a file. Take a look at it by running

$ cat temp_history.log

I can see how many columns, or fields, are in each line by running this command:

$ awk '{print NF}' temp_history.log

The ouput shows that I have four lines that have 4,23,539, and 1910 fields respectively. This is based on using a space as the field delimiter. I need to more closely examine the structure of the data in order to effectively leverage awk.

If I run this command,

$ awk '{print $1, $2}' temp_history.log

the output is as follows:

122 Start-Date: 123 Commandline: 124 Install: 125 Upgrade:

We are interested in the third and fourth lines, so we will need to look closer at the structure of those lines.

Scripting

Now we will write a script. We will assume that the user provides the two line numbers we are interested in when he or she invokes the script.

To make things easier, we will create two separate files to deal with the Install and Upgrade sections. This is due to the fact that the two sections structure the data in different ways.

sed -n $1p >> install.log

sed -n $2 >> upgrade.log

Both files present an interesting issue because the fields we need are delimited by a comma, but some of the needed field also contains a comma. I solved this issue by adding an additional closing parens, then using a parens-comma as the delimiter. This allowed me to split the fields the way we needed to. I then added a newline character at the end of each field, cleaned up the output of the first and last line, and sorted the list. Finally, I directed the output to a new file.

Now I can manipulate this file to my hearts content. One option is to create a .csv file and open it in LibreOffice Calc. In order to do this, I need to transform the data a little bit.

First we will deal with the upgraded data, saving the output to a temp file.

awk '/Upgraded/' install_list.log | sed s/":"/","/g | sed s/", "/","/g | sed s/" ("/","/ | sed s/") "/","/ | awk '/Upgraded/ {print $0}' > up6.tmp

Then we will do the same for the installed data. Here we need to move some data around to have it line up in the proper column, so we use a slightly different awk command at the end. We will append the output to our temp file.

awk '/Installed/' install_list.log | sed s/":"/","/ | sed s/", "/","/g | sed s/" ("/","/ | sed s/") "/","/ | sed s/"automatic,"/""/g | sed s/"Installed"/" ,Installed"/ | awk -F, '/Installed/ {print $1","$2","$4","$3","$5}' >> up6.tmp

Now we will use awk to print the temp file, piping the output thru the sort utility, to a new .csv file.

awk '{print}' up6.tmp | sort > up6.csv

Lastly, we will open the .csv file in LibreOffice Calc. On some machines this may result in a font error, and the spreadsheet will be populated with strange characters.

localc up6.csv

The whole script is here: grepsedAWK_script.txt. As always, use at your own risk!

Featured Website & Tip:

[1:27:05]

GNU Utilities for Win 32: (unxutils.sourceforge.net) If you wish you had access to one of those nifty UNIX command-line tools we talked about in this episode, but you are stuck running Microsoft Windows, these programs/packages can help.

GnuWin: (gnuwin32.sourceforge.net) Here are some ports of common GNU utilities to native Win32. In this context, native means the executables do only depend on the Microsoft C-runtime (msvcrt.dll) and not an emulation layer like that provided by Cygwin tools. If you don’t care about having a full UNIX shell environment, this is a great download for you.

More Information:

Hosts:: James, Rob, Scott

Live Stream (Mondays at 8:00 p.m. Eastern): mintcast.org

Contact Us:

Forum: forums.linuxmint.com
Email: [email protected]
Twitter: @mintCast @Linux_Mint @3dbeef @jamescoyner @txhawkins
IRC: irc.spotchat.org – #mintcast
Google+: mintCast

More Linux Mint info: website, blog, forums, community

Credits: Podcast Entry and exit music provided by Mark Blasco (podcastthemes.com). The podcast’s bumpers were provided by Oscar.

Podcast: Play in new window | Download

Subscribe: RSS