|
I often prefer the shell and Unix utilities to having to wait to
load data into a relational database or MS Access (unless a lot of
what I'm doing requires complex joins). It's often possible to not
even have to transform the encoding of the files before analyzing
them. There are a couple of recipes for doing SQL equivalents with
the shell utilities. There are a bunch of them, these are a few that
I just used this morning. Most of the time all it takes is a bit of
imagination about how to create a simple data-flow by composing a
small handful of ubiquitous Unix utilities.
All these examples will also work within
the Cygwin environment for
Windows, or at a Terminal in OS X (especially when combined with the
additional software available
via Fink
or Mac Ports projects..
"SELECT COUNT(*) FROM TABLE"
Just selecting the count of records from an input file is one of the
easiest things to accomplish (if your file is already line-oriented).
The wc, or word count utility can do this easily. By
default it counts characters, words and lines. With '-l' it will emit
only the count of lines.
user@host:~/data$ wc -l table.tab
10
If you want to ignore the header, start with the second line (see
the next example for a more thorough explanation):
user@host:~/data$ tail -n +2 | table.tab
9
"SELECT COUNT(DISTINCT(FIELD1)) FROM TABLE"
For getting a distinct count of values in a column :
user@host:~/data$ cut -f1 table.tab | tail -n +2 | sort | uniq -c
The first part cut is a utility that allows you to take
particular columns from a tab-delimited file, or character ranges from
a fixed-width file. cut also allows you to specify the
delimiter - but be warned that the commonly encountered CSV format
requires more complexity in parsing than just using a comma as a
delimiter. So this takes the first column out of the input file.
The next part of that is the tail command. tail
is a command that outputs the end or 'tail' of a file. The '-n'
option says what line number to start at (counted from the end of the
file) - in this case the '+' tells tail to start at the
second line from the beginning. This effectively tosses out the
header line.
Next the values themselves are sorted. This is necessary for the
uniq command, which will only collapse or count duplicate
lines when they are adjacent.
Finally we reduce duplicate lines with uniq. The '-c'
tells it to emit the count of duplicates when collapsing them.
Dealing with varoius file archive types
I often work with files in zip archives and tar (unix tape archive)
archives, sometimes with additional compression applied to them (.Z,
unix compress; .gz, gzip; and .bz2 bzip). It is possible to work with
these files without having to unarchive or decompress them permanently
if all you need is a simple count of lines or to only process them
once.
Pulling a file from a Zip Archive
To pull one or more files from within a zip archive, and send them
to another command (as part of a pipeline):
user@host:~/data$ unzip -l archive.zip
Archive: archive.zip
Length Date Time Name
-------- ---- ---- ----
34 04-28-08 11:01 table1.tab
56 04-28-08 11:01 table1.tab
user@host:~/data$ unzip -c archive.zip table1.tab table2.tab | wc -l
36
That example uses the unzip command to pull 2 files out
and send them to standard output - to either the screen or the next
command in the pipeline. In this case that is wc to get the
combined record count for the two files. We don't have to worry about
cleaning up the two files when we're done either.
unzip -l lists the files in a zip archive. If we left off
the '-c' (and the '| wc -l') unzip would have extracted just those two
files from the archive (in case there were more and you only wanted a
small handful).
Dealing with various file encodings
The first barrier to using most of these readily available utilities
is often the file formats themselves. The utilities are line-oriented
for the records and tab-oriented for the fields. So the first step is
often figuring out how to even get your data into a tab-delimited
format.
There is more to that than I have time to write right now. I'll
work at following up with examples for another posting. |