Received: with ECARTIS (v1.0.0; list gopher); Fri, 28 Dec 2007 01:22:46 -0600 (CST) Received: from static-71-170-11-156.dllstx.dsl-w.verizon.net ([71.170.11.156] helo=turquoise.pongonova.net) by glockenspiel.complete.org with esmtp (Exim 4.63) id 1J89YJ-0004eu-W2 for gopher@complete.org; Fri, 28 Dec 2007 01:22:46 -0600 Received: by turquoise.pongonova.net (Postfix, from userid 1000) id 99706726; Fri, 28 Dec 2007 01:23:39 -0600 (CST) Date: Fri, 28 Dec 2007 01:23:39 -0600 From: brian@pongonova.net To: gopher@complete.org Subject: [gopher] Improved binary file detection in Bucktooth 0.2.2 Message-ID: <20071228072339.GA25327@pongonova.net> Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.5.1i X-Spam-Status: No (score 0.6): AWL=0.000, NO_REAL_NAME=0.55 X-Virus-Scanned: by Exiscan on glockenspiel.complete.org at Fri, 28 Dec 2007 01:22:46 -0600 Content-Transfer-Encoding: 8bit X-archive-position: 1772 X-ecartis-version: Ecartis v1.0.0 Sender: gopher-bounce@complete.org Errors-to: gopher-bounce@complete.org X-original-sender: brian@pongonova.net Precedence: bulk Reply-to: gopher@complete.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: Gopher X-List-ID: Gopher List-subscribe: List-owner: List-post: List-archive: X-list: gopher I'm using buckd to serve up binary files, and noticed that several binary files (mostly older PDFs with a lot of text in the file header) were being identified as item type "0" rather than "9". It turns out that buckd uses the Perl -B operator to determine binary files. To do this, it examines some number of bytes in the file header for certain characteristics (nul bytes, high-order bits set, etc.) and if that number of bytes exceeds 30%, Perl identifies it as a binary file. This wasn't accurate enough for my purposes, so I modified buckd.in so that it calls the UNIX "file" command and greps for the string "text" (guaranteed to be returned if a file is identified as a text file). I just want to emphasize that this is *not* a problem with Bucktooth, but rather an issue with Perl. Here's the patchfile with the change. I opted to modify buckd.in and simply regenerate buckd. --- buckd.in 2007-12-28 01:21:30.000000000 -0600 +++ buckd.in.new 2007-12-28 01:20:58.000000000 -0600 @@ -289,7 +289,7 @@ ($xentr =~ /\.jpe?g$/i) ? "I" : ($xentr =~ /\.html?$/i) ? "h" : ($xentr =~ /\.hqx$/i) ? "4" : - (-B $xentr) ? "9" : + (grep(!/text/, `file $xentr`)) ? "9" : "0"; $xentr =~ s/^$DIR//; return ($itype, ($pentr eq $xentr) ? '' : $xentr); --Brian