Awk

What it is: A language designed for searching files for records or columns that contain certain patterns and then performing actions on those records.

How it's built: 'True' awk was developed in the 70s and 80s and is implemented in C. The source can be found on github.

How to use it: Awk is often used as a command in the format awk '/regex/ { do something; }' myfile, but because it has a rich library of variables and flow-control, longer scripts with more complex effects can be acheived.

THE STORY

Awk! Awk! I love the sound of this instrument. It makes me think of a large bird making funny vocalizations. It doesn't hurt that it sounds like an abbreviation of the word awkward. Awk!

I think it has a bad reputation. I told a friend I was learning awk and he was like yuck. Ask around.

It's an interesting tool to me because I was introduced to it as a simple search tool for column-based data, to be used like awk /regex/ myfile . It wasn't until I did my shell scripting deep-dive this year that I learned awk is actually a whole language. That blew my mind. It's not just for passing a flag or two. Awk has real power and flexibility for dealing with columnar data.

For example, lets say we have a csv file with 50K records in it, where the first field is an account number and the third field is an account name. And lets say I need to find every record that has an account number starting with '40-401-' followed by a long string of numerals. And I need a count of how many records fit this pattern and I need the names on each of these accounts and I need it all quickly.

One way I could do this would be opening the file in VSCode and then doing a search for '40-401-'. Some problems with this. First, if the file is, say, 500k records long, VSCode might trip and fall trying to open it. Second, when you do Ctrl-F 40-401 it might lie to you by including records like '41-129-40-401-5'. So you still have to go through and verify each record. Third, remember that you need the names on each account as well. So what is your next step? Copy and paste the name of each record that showed up in your search? Yikes. But I've seen people do exactly this. With giant files.

What if you could do something like

awk 'BEGIN { FS="," ; print "Account Number, Account Name" }
/^40\-401\-/ { print $1 "," $3 ; count++ }
END { print "Total Accounts = " count }' myfile.csv

and voila! Immediate results. Write it to a file and send it off.

Awk has a reputation for being convoluted and complex, but if you are buying that reputation then I think you should take a closer look at the code snippet above. It's actually pretty friendly. There is a core code block { print $1...} preceded by a range (in this case the regex /^40\-401\-/ which indicates 'only look at lines starting with 40-401-'. To get the number of matching records we simply use a variable count which we just declare and increment all at once. Gosh, how easy. This core code block is flanked by two other code blocks (delineated by {}, again how easy) which use the special words BEGIN and END. BEGIN executes once at the start of the script and END executes once at the end, meaning they are perfect tools for adding a header and final count to our output. Soooo easy. And maybe the best part is that you can stash the little awklet into a file, say account_names.awk, and call it whenever you need like awk -f account_names.awk myfile.csv. If you really thought you would be using this snippet frequently you could even wrap the whole command in a shell file and allow it to accept arguments for the regex and the particular columns you want in the output. What. Could. Be. Easier.

So, take another look at awk. You can accomplish a lot with it. Especially if you find yourself working with text data and csv files frequently. You will likely discover a lot of quick and efficient ways to automatic your work. Thanks for reading!