mintCast 144: grepsedAWK

mintcast144.mp3
mintcast144.ogg

News:

  • [ 9:20] Developers at the Treasury Board of Canada create popular open source project. (wired.com)
  • [13:10] Fedora 18 – Spherical Cow will be released on Jan 15th. (linuxuser.co.uk)
  • [14:45] Lego goes Linux. (internetnews.com)
  • [19:02] IBM’s Watson undergoes brainwashing to forget all the naughty words in the Urban Dictionary. (techdirt.com)
  • [22:00] The White House responds to a petition calling for the construction of a Death Star. (techcrunch.com) (petitions.whitehouse.gov)

The Main Topic:

[26:50] Regular Expressions

Online Resources:

From Wikipedia:In computing, a regular expression is a specific pattern that provides concise and flexible means to “match” (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. Common abbreviations for “regular expression” include regex and regexp.

Basic ideas:

In it’s simplest form, a regular expression is a string of symbols to match “as is” (e.g., “mint” would match those four characters)

Quantifiers let you match more than one character:

  • * matches any number of what’s before it, from zero to infinity.
  • ? matches zero or one.
  • + matches one or more.
  • {n} matches exactly “n” occurances
  • {n,m} matches at least “n” and not more than “m” occurances

Some special characters are used to match things:

  • . – The dot matches any single character.
  • n – Matches a newline character (or CR+LF combination).
  • t – Matches a tab (ASCII 9).
  • d – Matches a digit [0-9].
  • D – Matches a non-digit.
  • w – Matches an alphanumberic character.
  • W – Matches a non-alphanumberic character.
  • s – Matches a whitespace character.
  • S – Matches a non-whitespace character.
  • – “Escape” special characters. For example, . matches a dot, and \ matches a backslash.
  • ^ – Match at the beginning of the input string.
  • $ – Match at the end of the input string.

Group characters by putting them between square brackets. This way, any character in the class will match one character in the input.

  • [abc] Match any of a, b, and c.
  • [a-z] Match any character between a and z. (ASCII order)
  • [^abc] A caret ^ at the beginning indicates “not”.
  • [+*?.] Most special characters have no meaning inside the square brackets.

Group expressions using parentheses “(“ and “)”. The vertical bar “|” is a Boolean OR operator

[35:30] grep

Online Resources:

The grep command searches one or more input files for lines containing a match to a specified pattern. By default, grep prints the matching lines. If no filename is given on the command line, grep searches standard input.

You use grep in the following manner:

$ grep [OPTIONS] PATTERN [FILENAME …]

Common OPTIONS include:

  • -h – if you search more than one file at a time, the results contain the name of the file from which the string was found. (See the example using ‘quite the’). This option turns off that feature, giving you only the lines without the file name.
  • -n – precedes each line with the line number where it was found
  • -i – tells grep to ignore case so that it treats “the” and “The” as the same word
  • -l – displays a list of files that contain the string
  • -w – restricts the search to whole words only

PATTERN is a regular expression. grep understands “basic” (BRE), “Extended” (ERE) and “perl” (PRCE) expressions. In GNU grep, there is no difference between the basic and extended syntax.

The “-f FILE” or “–file=FILE” option allows you to specify the name of a FILE containing regular expressions to be used for PATTERN, one per line.

A common way to use grep is as a filter on the output from another program. This is the most common way people first encounter grep.

see if a process named “firefox” is running:

$ ps -A | grep firefox

get kernel messages related to USB devices:

$ dmesg | grep -i usb

show the serial ports on the machine:

$ dmesg | grep -i tty

Show how much RAM is available on the system:

$ dmesg | grep -i memory

[44:55] sed

Online Resources:

sed (stream editor) is a Unix utility that parses text and implements a programming language which can apply transformations to such text. It reads input line by line (sequentially), applying the operation which has been specified via the command line (or a sed script), and then outputs the line. It was developed from 1973 to 1974 as a Unix utility by Lee E. McMahon of Bell Labs, and is available today for most operating systems.

Sed and Awk (or Gawk) both have their origins in the line editor ed.

Sed works by specifying a pattern to match, and a procedure (or action) to perform, as does Awk.

There are two ways to invoke sed and awk: either you specify your editing instructions on the command line or you put them in a file and supply the name of the file.

sed is very useful for transforming text in a file or series of files. There are several usages that are very common. Perhaps the most common use is for substitution, accomplished like this:

sed s/pattern to match/pattern to replace with/ input file

Printing is another common usage:

Print the single line that corresponds to line-number:

sed -n line_numberp

prints lines 1 thru 10:

sed -n 1,10p

In this example, the -n option suppresses the default output of all lines, while the p option prints the matching line.

By default, sed directs all output to STDOUT. If you want to capture the output, you need to redirect it to a file with the > or >> symbols and a filename.

The -i switch allows for in-place editing, rather than directing output to STDOUT

sed acts on each line in a file, reading the line into a buffer, then applying the specified actions to that line before moving on to the next line. You can specify more than one action to be performed on a line using the -e option

Other useful sed commands allow you to delete, append, insert, list, print line number and more.

[51:49] AWK

Online Resources:

The basic function of awk is to search files for lines (or other units of text) that contain certain patterns. When a line matches one of the patterns, awk performs specified actions on that line. awk keeps processing input lines in this way until it reaches the end of the input files.

The name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger and Brian W. Kernighan. The original version of awk was written in 1977 at AT&T Bell Laboratories. In 1985, a new version made the programming language more powerful, introducing user-defined functions, multiple input streams, and computed regular expressions.

Paul Rubin wrote the GNU implementation, gawk, in 1986. Jay Fenlason completed it, with advice from Richard Stallman. John Woods contributed parts of the code as well. In 1988 and 1989, David Trueman and Arnold Robbins thoroughly reworked gawk for compatibility with the newer awk.

Current development focuses on bug fixes, performance improvements, standards compliance, and occasionally, new features. In May of 1997, Jürgen Kahrs felt the need for network access from awk, and with help Robbins, set about adding features to do this for gawk. At that time, he also wrote the bulk of TCP/IP Internetworking with gawk (a separate document, available as part of the gawk distribution). His code finally became part of the main gawk distribution with gawk version 3.1. John Haque rewrote the gawk internals, in the process providing an awk-level debugger. This version became available as gawk version 4.0, in 2011.

On a Mint 14 Mate machine, awk is a symbolic link to gawk, or GNU awk version 4.0.1. For simplicity’s sake, I will be using the term awk to refer to the utility from here on out.

“AWK is a language for processing text files. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.” – Alfred V. Aho

An AWK program is a series of pattern action pairs, written as:

''condition'' { ''action'' }

where condition is typically an expression and action is a series of commands. The input is split into records, where by default records are separated by newline characters so that the input is split into lines. The program tests each record against each of the conditions in turn, and executes the action for each expression that is true. Either the condition or the action may be omitted. The condition defaults to matching every record. The default action is to print the record.

AWK also allows for the inclusion of a BEGIN and or END procedure to be performed before or after the condition/action piece. Once use of this functionality can be to include headers and footers in the output of your awk command or script.

AWK uses a space or tab as its default delimiter, but you can set it to anything you want by using the -F option.

Like sed, awk can print using the following syntax:

Print entire input_file:

$ awk ‘{print $0}’ input_file

Print the first and fourth fields of each line in input_file:

awk ‘{print $1, $4}’ input_file

There are many different different ways awk can be leveraged. Just a few include the following:

  • Add line numbers to a files or files
  • Double or triple-space a file
  • Print the total number of words in a file
  • Convert Unix newlines to Dos (and vice versa)
  • Delete trailing whitespace from the end of each line
  • Add spaces or tabs to the beginning of each line
  • Align text to the left, right or in the center of each line
  • Perform a ”find and replace” on each line (similar to sed)
  • Emulate head, tail, uniq and grep

awk is actually a full-fledged programming language.

[58:20] Case Study – Using grep, sed and AWK

I need to examine the history.log file located in /var/log/apt in order to determine what updates came with LMDE UP 6. The log files is very detailed and it is hard to discern the needed information. Chopping the data up and presenting it in a different manner could be very helpful.

First, we determine how many lines there are in the file:
$ awk 'END { print NR }' history.log

Now we give each line a number. Sed and AWK don’t need this, but it will help us to identify which lines we need. Remember that AWK does not change the original file and sends its output to STDOUT, so we will need to redirect the output to a new file.

$ awk '{ print NR, $0 }' history.log >> num_history.log

We can also determine needed line numbers by using grep. Knowing the structure of the history.log file, we can search for instances of “Start-Date” using this command:

$ grep -n “Start-Date” history.log

Once we determine the date of the upgrade, we can grep for that, allowing us to determine what lines we need to get to identify the installed and upgraded packages.

Now, we copy out the chunk of the log that details UP 6 (you will need to look at the file to determine on what line the update begins and ends. We can then copy those lines to a temp file with this command:

$ sed -n '122,125p' num_history.log >> temp_history.log

Now we have the UP 6 log information in a file. Take a look at it by running

$ cat temp_history.log

I can see how many columns, or fields, are in each line by running this command:

$ awk '{print NF}' temp_history.log

The ouput shows that I have four lines that have 4,23,539, and 1910 fields respectively. This is based on using a space as the field delimiter. I need to more closely examine the structure of the data in order to effectively leverage awk.

If I run this command,

$ awk '{print $1, $2}' temp_history.log

the output is as follows:

122 Start-Date:
123 Commandline:
124 Install:
125 Upgrade:

We are interested in the third and fourth lines, so we will need to look closer at the structure of those lines.

Scripting

Now we will write a script. We will assume that the user provides the two line numbers we are interested in when he or she invokes the script.

To make things easier, we will create two separate files to deal with the Install and Upgrade sections. This is due to the fact that the two sections structure the data in different ways.

sed -n $1p >> install.log

sed -n $2 >> upgrade.log

sed s/")"/"))"/g install.log | sed s/"), "/"n"/g | sed s/"))"/")"/g | sed s/")"/") Installed"/g | sed s/"Install: "/""/ | sort > install_list.log

sed s/")"/"))"/g upgrade.log | sed s/"), "/"n"/g | sed s/"))"/")"/g | sed s/")"/") Upgraded"/g | sed s/"Upgrade: "/""/ | sort >> install_list.log

Both files present an interesting issue because the fields we need are delimited by a comma, but some of the needed field also contains a comma. I solved this issue by adding an additional closing parens, then using a parens-comma as the delimiter. This allowed me to split the fields the way we needed to. I then added a newline character at the end of each field, cleaned up the output of the first and last line, and sorted the list. Finally, I directed the output to a new file.

Now I can manipulate this file to my hearts content. One option is to create a .csv file and open it in LibreOffice Calc. In order to do this, I need to transform the data a little bit.

First we will deal with the upgraded data, saving the output to a temp file.

awk '/Upgraded/' install_list.log | sed s/":"/","/g | sed s/", "/","/g | sed s/" ("/","/ | sed s/") "/","/ | awk '/Upgraded/ {print $0}' > up6.tmp

Then we will do the same for the installed data. Here we need to move some data around to have it line up in the proper column, so we use a slightly different awk command at the end. We will append the output to our temp file.

awk '/Installed/' install_list.log | sed s/":"/","/ | sed s/", "/","/g | sed s/" ("/","/ | sed s/") "/","/ | sed s/"automatic,"/""/g | sed s/"Installed"/" ,Installed"/ | awk -F, '/Installed/ {print $1","$2","$4","$3","$5}' >> up6.tmp

Now we will use awk to print the temp file, piping the output thru the sort utility, to a new .csv file.

awk '{print}' up6.tmp | sort > up6.csv

Lastly, we will open the .csv file in LibreOffice Calc. On some machines this may result in a font error, and the spreadsheet will be populated with strange characters.

localc up6.csv

The whole script is here: grepsedAWK_script.txt. As always, use at your own risk!

Featured Website & Tip:

[1:27:05]

  • GNU Utilities for Win 32: (unxutils.sourceforge.net) If you wish you had access to one of those nifty UNIX command-line tools we talked about in this episode, but you are stuck running Microsoft Windows, these programs/packages can help.
  • GnuWin: (gnuwin32.sourceforge.net) Here are some ports of common GNU utilities to native Win32. In this context, native means the executables do only depend on the Microsoft C-runtime (msvcrt.dll) and not an emulation layer like that provided by Cygwin tools. If you don’t care about having a full UNIX shell environment, this is a great download for you.

More Information:

Hosts:: James, Rob, Scott

Live Stream (Mondays at 8:00 p.m. Eastern): mintcast.org

Contact Us:

More Linux Mint info: website, blog, forums, community

Credits: Podcast Entry and exit music provided by Mark Blasco (podcastthemes.com). The podcast’s bumpers were provided by Oscar.

10 thoughts on “mintCast 144: grepsedAWK

  1. I would like you to review Fedora 18 and compare differences to Mint and see what you think about it.

  2. I really like this episode. Please keep doing such a professional podcast (I’ve already new sed, grep and awk commands but nevertheless you’ve succeed to learn me more about it!)

  3. Feedback on command line utilities: I don’t think that radio is the correct medium for explaining very detailed explanations of how commands are used. Unless one is very conversant with the commands, it is impossible to assimilate all the minutiae of brackets and arguments etc. Discussing the relative merits of ark versus sed is useful. Emphasis, however,should be placed on the ‘benefits’ of using these commands and not how they are used in any great detail as this is more suited to a text medium, in this case ‘show notes’.
    For example it is sufficient to say something like ‘…and the output of the sed command is fed to grep via a pipe to give us the desired output….’ no mention of brackets and arguments etc.
    I like the show, it is still the one I look forward to most …. but James, where are you? 🙂

  4. Like everybody else, I have a comfort level with the command-line.
    My daily use of the terminal is due to speed and ease compared to a GUI. Throw in the factor that the command-line works on any distro, running any desktop or windows-manager, iOS, and even some Windows PowerShell, and you find yourself with an indispensable skill-set.
    It’s painful to learn, though. Reading most man-pages usually sends me googling for real-world examples. I pick up most of what use from someone on a forum telling me to “copy and paste this into the terminal.” Once I see the carrot, I’ll find a way to get there – sticks occur with or without the carrot, so you might as well figure it out.
    I’ve learned over time that I can usually pick up something good listening to a podcast on a subject I have zero interest in, like GREP, SED, and AWK. Forty minutes into the podcast, I found myself thinking, “oh – how much you wanna bet Google Docs uses a GREP command on the back-end when I use the search feature?”
    So; thanks for the podcast, guys. Thanks for the epic show-notes. All I gotta do now is incorporate the new tools into the rest of the toolbox.

    As to the Death Star petition, you know for every ten White House Staffer’s who could have written a responce to this, nine wanted nothing to do with it, but the tenth was a Geek, screaming “MemeMEmmeme! I was BORN to write this!”
    Toby and Sam from the West Wing looked at one another, and said, “don’t embarrass us too badly, and keep the President’s name off of it.”

  5. I really enjoyed this episode and would like to hear more of this kind of info on your podcast. Yes – some things are learned easier with a text medium – but that’s where the show notes come in.

    I’ve used linux for the desktop for several years – but need to learn a lot more about command-line entries.

  6. Hi,

    I had troubles with audio content from the various podcasts I subscribe to.. I now use mp3gain to normalise the audio gain to a non clipping level. This is a good program because is analises the whole file and adjusts to suit.

    There is a similar one for ogg called vorbisgain, but it doesn’t do the same job.

    Perhaps a nice way to get consistency to the levels would be to export to mp3, run mp3gain, then export that to vorbis. Just an idea.

    I use mp3gain in a script, works well.

  7. Hi, I also found the segment on SED AWK and GREP very informative and interesting. Starting to learn commands always has a hurdle and for you to do the leg work and showing how it can be applied and used in real life is also tremendously helpful. So thank you very much and I would like to see more of shows like these coming.

  8. Thanks for all your good work. I really liked your approach to grep, sed and awk. The level of detail presented was just about right, and the practical application to a real world project was great! Being brave enough to show some false starts and the opportunities for improvement works as a good example and a bit of role-modeling.

    Some say command line content does not make good radio. Ok. But as long as one doesn’t get too carried away (Klaatu on gnuWorldOrder reads too many detailed commands) you can generate some real interest and point it in the right direction. I realize a lot of hard work went into the prep for this feature, so I wouldn’t expect to see them as more than an occasional feature, but please continue! You really hit the bullseye this time.

Comments are closed.