Received: with ECARTIS (v1.0.0; list gopher); Fri, 28 Dec 2007 07:50:05 -0600 (CST) Received: from floodgap.com ([66.159.214.137] ident=elvis) by glockenspiel.complete.org with esmtp (Exim 4.63) id 1J8Fb9-0000nH-Mi for gopher@complete.org; Fri, 28 Dec 2007 07:50:05 -0600 Received: (from spectre@localhost) by floodgap.com (6.6.6.666.1/2007.10.21) id lBSDnxwg011630 for gopher@complete.org; Fri, 28 Dec 2007 05:49:59 -0800 From: Cameron Kaiser Message-Id: <200712281349.lBSDnxwg011630@floodgap.com> Subject: [gopher] Re: Improved binary file detection in Bucktooth 0.2.2 In-Reply-To: <20071228072339.GA25327@pongonova.net> from "brian@pongonova.net" at "Dec 28, 7 01:23:39 am" To: gopher@complete.org Date: Fri, 28 Dec 2007 05:49:59 -0800 (PST) X-Mailer: ELM [version 2.4ME+ PL39 (25)] MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-Spam-Status: No (score 0.0): AWL=0.004 X-Virus-Scanned: by Exiscan on glockenspiel.complete.org at Fri, 28 Dec 2007 07:50:05 -0600 X-archive-position: 1773 X-ecartis-version: Ecartis v1.0.0 Sender: gopher-bounce@complete.org Errors-to: gopher-bounce@complete.org X-original-sender: spectre@floodgap.com Precedence: bulk Reply-to: gopher@complete.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: Gopher X-List-ID: Gopher List-subscribe: List-owner: List-post: List-archive: X-list: gopher > I'm using buckd to serve up binary files, and noticed that several > binary files (mostly older PDFs with a lot of text in the file header) > were being identified as item type "0" rather than "9". It turns out > that buckd uses the Perl -B operator to determine binary files. To do > this, it examines some number of bytes in the file header for certain > characteristics (nul bytes, high-order bits set, etc.) and if that > number of bytes exceeds 30%, Perl identifies it as a binary file. > > This wasn't accurate enough for my purposes, so I modified buckd.in so > that it calls the UNIX "file" command and greps for the string "text" > (guaranteed to be returned if a file is identified as a text file). The other thing I might do is just expand the number of file extensions Bucktooth recognizes and generates item types for, since -B is the fall-through case and there will always be datasets falling in the tails of the bell curve. The 'file' command approach is ingenious but does of course have performance implications (though since I'm running out of inetd I'm obviously not that concerned with performance ;-). I appreciate the heads-up. -- ------------------------------------ personal: http://www.cameronkaiser.com/ -- Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckaiser@floodgap.com -- Life isn't fair. But having the root password helps. -----------------------