I wrote the Sheriff from scratch, based primarily on the ScanMaster list and later the Qx list. After it was done I started looking at other lists that I could find. This page describes my thoughts and feelings regarding the suiteability of certain formats for the Sheriff alone. One always gets the question Why don't you read list such and so? Well, this page describes why. As related to the Sheriff.
A secondary reason for the extent of this "diatribe" is to provide an overview of the variety of formats and to attempt to put them into historic perspective. By doing this I hope to contribute a little to the process of coming to standardized lists in the general sense.
Maybe there already is a FAQ for textual file formats. In which I would like to know the URL so I can link to it. It is better to come to one standard FAQ then to duplicate one anothers work. In case you're wondering. No, I did - at the time - not know about Checker and others existence for Win32. Had I known I would have done a Java version instead.
Of course it eats list of files. But not just any file. Its favorite meal consists of lists it itself has generated. Prefereably with all the options on top. That way it knows all the necessary ingredients are present and no necessary vitamins have been left out.
One of the formats that seems to be popular is often termed the CSV list. Probably this refers to Comma Seperated Value list. CSV lists originated, as far as I am aware, with at least WordStar for CP/M. That wordprocessor used those list to facilitate mail-merging form letters. WordStar CSV lists look like this:
In short they are in USASCII. Each line forms one entry for the mail merger to mail a form letter to. Each line is seperated from the next by a CR/LF (CarriageReturn, LineFeed) pair. Each item in a line is seperated from the next, if present, with a comma. An item may be numeric or it may be textual. Textual items are surrounded with quotes.
Alas, over time this format has deteriorated somewhat. An example of benign deterioration can be found, for example, in the checker_wysiwyg.txt file I stumbled across. Symptomatic of this light form of deterioration is the lack of quotes surrounding textual items that form one whole word. In other words:
Though this still looks quite resonably, the venom is in its tail. What, exactly, makes a word. Normally a period ends a sentence and, by extension, the word it is placed after. Yet in this case it is clear that does not hold true for the item calanp01.jpg. Even so, if that is all then it is easily overcome.
Unfortunately, life is never simple. As may become evident from looking at our next example. This is taken from the SWA_Phantom.csv file that was found hidden in a posted zip file of the same name. Albeit with a '.zip' extension. The CSV file as presented by Scanners With Attitudes looks thus:
From looking at this snippet it becomes clear that the period still is treated like a normal alphabetic character. Furthermore, the space has now also been demoted from its traditional special status as a seperating character used to delimit words to just another letter. At least, if we assume that Kim Page is to be treated as a single item and not as the item Kim followed by the item Page. Quite natural if you happen to be a warm blooded human being with the heart more or less at the right place. But alas, the Sheriff happens to be a program and full Artificial Intelligence was a bit much for a first release. The heart, by the way, is an option available at a slightly inflated price.
But wait, our trained eye has spotted quotes. Even double quotes as prescribed by the good old WordStar format. Unfortunately they have been put to wrong use. Granted it looks quite fine, but then, I have not shown you the ace in my sleave yet. The quotes in this case surround the item 206,351. Quite reasonable considering that it has a comma in it. By surrounding the item with double quotes the author is telling us to treat it as a single item nonetheless. Me thinks it is about time to show you that ace:
This happens to be the first line of that list. I think it is safe to assume that the author is trying to tell us something with it. Presumably that would be that the first item is the filename of the JPEG image. The second item its (file)size and the last item will contain the models portrayed by the image. Presumably we are also to deduce that only three items will ever be present. Probably meaning that if we happen to find a comma in the third item we must treat it as yet another regular letter. Again, quite reasonable if you happen to be a human. With the optional heart.
But another thing it tells us is that the second item is the filesize. Yet of all the items found in the file only one was surrounded by quotes. Namely the second one. The one supposed to stand for the size. Again, no problem if you happen to be a human with a lot of understanding. Unfortunately one of the things sorely lacking in the current release of the Sheriff. Like compassion. A somewhat less undestanding program will look at the double quotes and think: text! And treat it thus. A good thing too since there are comma's in them letters.
But let us cut through this chase and take a look at the venom in its tail. Not only is the filesize a number that will be treated as text, even where we to treat it as a number we would run into trouble. For there is a comma in that number. Depending on the international settings of the Windows installed that may or may not be correct. Fortunately the Sheriff will discard all none-numerals from items it treats as numbers. Even so another problem with the CSV format has been highlighted. To wit, sometimes the specifications will be applied in reverse.
Another problem that alas was not present in this list was the case of the comma in the last item. Judging from that first line there are only supposed to be three items. The question remains whether it is possible to build in sufficial artificial intelligence into the next release so that the Sheriff will be able to read and act accordingly. One can try, but one should not hold ones breath waiting. An example of the vaunted comma in a single text item can be found in the checkfile known as checker_simulator.csv. I think this particular selection is the most appropriate:
A real beauty, indeed.
Even though there is undoubtedly a lot more to tell about the Comma Seperated Value file format, I am inclined to let it rest for a while. I think we have seen enough examples to agree with the acronym in so far as that it has indeed comma's between the values. At the very least. In short, it used to be a nice and clean format. Back in 1978. But I think today we can do without it. If you come across a CSV file you need to use, for now just change (most of) the comma's in it to spaces. Double quotes are also to be considered harmful.
There is also a slight disadvantage to their usage. Which is that the tabs are all there is. The format, again as far as I am aware, has no provision to indicate whether a column is textual or numeric. The CSV format uses the double quotes to indicate text. All else are numbers. Since the tab delimited list does not have this, you can only make an educated guess. Fortunately the Sheriff is well educated. Indeed, as you might have guessed, the Sheriff has no problem with a diet consisting of tab delimited lists. An example of their use would be the list as posted by RonScan. But there is hardly a point in including it since HTML does not do tabs very well. And using tables would be self defeating.
This is not to say that it is perfect. For what is. It took me a while to find an example and realize that I had found it. It was only after the addition of a special tab delimited format recognizer that it became apparent. This problem holds for any delimited format so I shall provide an example using CSV, tabs being difficult to visualize.
This list, the RonScan 212 list, had a full complement of
information. To wit, filename, filesize, dimension, CRC and
description being the name of the model depicted. The list in comma
seperated format looks like this:
a-mlms02.jpg, 99094, 768, x, 529, 524A21F0, , Anna Maria
adalms01.jpg, 107409, 768, x, 570, 9789260F, , Alexah Adams
adrlin01.jpg, 118276, 768, x, 562, 3878323E, , Rhonda Adams
In other words, the parts of the dimension have been seperated and an
extra delimiter was inserted before the description. Now the Sheriff
has no problem with this when read as text file, for then a tab
character is seen as so much whitespace. Since one or more spaces are
used to delimit fields it does not see anything unusual. But not so
with true tab delimited files. There it will read several fields each
of which have been seperated from one another with a tab
character. The field in the above case it would expect are
Unfortunately that is not what it will get. The fields as implied by
the format of the file would be
A list generated by yours truly. Though LexHaring did post a Sheriff
list to the Usenet, I seem to have missed it. This of course also
means that there is no garantuee that the shown CRCs are the true
ones.
Even though this diet is made by the Sheriff for the Sheriff it does
not mean that it is perfect. Its worst drawback is that spaces in the
filename will put it off. Now this can easily be solved by building in
a recognizer specialized in tab delimited lists. I might in fact even
do just that. One of these days. On the bright side, however, is that
these lists are easy to read. They can also be mailed or posted
without having to UUencode or MIMEify (AKA Base64 encoding) them
first.
It will also read such a file as is. As well as anything vaguely
resembling it. Including CSV lists with the comma's and quotes turned
into spaces or tabs. However, the more columns are present the better
it will be able to determine if a line is valid or not. To demonstrate
we can feed it the above list and tell it to expect nothing but the
mandatory filenames and filesizes. In this pathological case it would
generate the following wanted list:
the real meal
So show us a full meal, already! Ok, here it is:
Generated by: The JPEG Sheriff (beta release 15 April 1997 16:20)
Generated at: Do, 17 Apr 1997 07:26:37
Find it at: http://www.worldonline.nl/~iboa
Filename Filesize W x H CRC-32 Description
------------ ---------- ----------- -------- -----------------
lexrip01.jpg 55.463 450 x 400 D2A5E9B7 Anna Nicole Smith
lexrip02.jpg 60.514 450 x 400 162BC753
lexrip03.jpg 64.541 450 x 400 A9E3C6DC
lexrip04.jpg 58.010 450 x 400 B0069E1A
lexrip05.jpg 74.106 450 x 400 D714E567
lexrip07.jpg 39.550 450 x 400 53E3377C
lexrip08.jpg 43.281 450 x 400 9F66795B
lexrip09.jpg 52.126 450 x 400 30E68CE5
lexrip10.jpg 63.781 450 x 400 BEFB5E72
lexrip11.jpg 60.671 450 x 400 6B1471CA
lexrip12.jpg 68.684 450 x 400 97B06EF6
------------ ---------- ----------- -------- -----------------
Count of files: 11
Total of sizes: 640.727
Generated by: The JPEG Sheriff (beta release 18 April 1997 03:34)
Generated at: Vr, 18 Apr 1997 03:36:23
Generated as: wanted files list
Find it at: http://www.worldonline.nl/~iboa
Filename Filesize Description
------------ ---------- ---------------------------------------
lexrip01.jpg 55.463 450 x 400 D2A5E9B7 Anna Nicole Smith
lexrip02.jpg 60.514 450 x 400 162BC753
lexrip03.jpg 64.541 450 x 400 A9E3C6DC
lexrip04.jpg 58.010 450 x 400 B0069E1A
lexrip05.jpg 74.106 450 x 400 D714E567
lexrip07.jpg 39.550 450 x 400 53E3377C
lexrip08.jpg 43.281 450 x 400 9F66795B
lexrip09.jpg 52.126 450 x 400 30E68CE5
lexrip10.jpg 63.781 450 x 400 BEFB5E72
lexrip11.jpg 60.671 450 x 400 6B1471CA
lexrip12.jpg 68.684 450 x 400 97B06EF6
------------ ---------- ---------------------------------------
Count of files: 11
Total of sizes: 640.727