| kyle_burton ( @ 2008-04-28 11:51:00 |
I often prefer the shell and Unix utilities to having to wait to load data into a relational database or MS Access (unless a lot of what I'm doing requires complex joins). It's often possible to not even have to transform the encoding of the files before analyzing them. There are a couple of recipes for doing SQL equivalents with the shell utilities. There are a bunch of them, these are a few that I just used this morning. Most of the time all it takes is a bit of imagination about how to create a simple data-flow by composing a small handful of ubiquitous Unix utilities.
All these examples will also work within the Cygwin environment for Windows, or at a Terminal in OS X (especially when combined with the additional software available via Fink or Mac Ports projects..
"SELECT COUNT(*) FROM TABLE"Just selecting the count of records from an input file is one of the easiest things to accomplish (if your file is already line-oriented). The wc, or word count utility can do this easily. By default it counts characters, words and lines. With '-l' it will emit only the count of lines.
user@host:~/data$ wc -l table.tab 10
If you want to ignore the header, start with the second line (see the next example for a more thorough explanation):
user@host:~/data$ tail -n +2 | table.tab 9"SELECT COUNT(DISTINCT(FIELD1)) FROM TABLE"
For getting a distinct count of values in a column :
user@host:~/data$ cut -f1 table.tab | tail -n +2 | sort | uniq -c
The first part cut is a utility that allows you to take particular columns from a tab-delimited file, or character ranges from a fixed-width file. cut also allows you to specify the delimiter - but be warned that the commonly encountered CSV format requires more complexity in parsing than just using a comma as a delimiter. So this takes the first column out of the input file.
The next part of that is the tail command. tail is a command that outputs the end or 'tail' of a file. The '-n' option says what line number to start at (counted from the end of the file) - in this case the '+' tells tail to start at the second line from the beginning. This effectively tosses out the header line.
Next the values themselves are sorted. This is necessary for the uniq command, which will only collapse or count duplicate lines when they are adjacent.
Finally we reduce duplicate lines with uniq. The '-c' tells it to emit the count of duplicates when collapsing them.
Dealing with varoius file archive types
I often work with files in zip archives and tar (unix tape archive) archives, sometimes with additional compression applied to them (.Z, unix compress; .gz, gzip; and .bz2 bzip). It is possible to work with these files without having to unarchive or decompress them permanently if all you need is a simple count of lines or to only process them once.
Pulling a file from a Zip Archive
To pull one or more files from within a zip archive, and send them to another command (as part of a pipeline):
user@host:~/data$ unzip -l archive.zip
Archive: archive.zip
Length Date Time Name
-------- ---- ---- ----
34 04-28-08 11:01 table1.tab
56 04-28-08 11:01 table1.tab
user@host:~/data$ unzip -c archive.zip table1.tab table2.tab | wc -l
36
That example uses the unzip command to pull 2 files out and send them to standard output - to either the screen or the next command in the pipeline. In this case that is wc to get the combined record count for the two files. We don't have to worry about cleaning up the two files when we're done either.
unzip -l lists the files in a zip archive. If we left off the '-c' (and the '| wc -l') unzip would have extracted just those two files from the archive (in case there were more and you only wanted a small handful).
Dealing with various file encodings
The first barrier to using most of these readily available utilities is often the file formats themselves. The utilities are line-oriented for the records and tab-oriented for the fields. So the first step is often figuring out how to even get your data into a tab-delimited format.
There is more to that than I have time to write right now. I'll work at following up with examples for another posting.