Received: with ECARTIS (v1.0.0; list gopher); Wed, 25 Jun 2003 00:30:12 -0500 (CDT) Return-Path: X-Original-To: gopher@complete.org Delivered-To: gopher@complete.org Received: by gesundheit.complete.org (Postfix, from userid 108) id 0559B183203C; Wed, 25 Jun 2003 00:30:09 -0500 (CDT) X-Scanned-By: clamscan at complete.org Received: from floodgap.com (netblock-66-159-214-137.dslextreme.com [66.159.214.137]) by gesundheit.complete.org (Postfix) with ESMTP id 73A0F1832014 for ; Wed, 25 Jun 2003 00:30:03 -0500 (CDT) Received: (from spectre@localhost) by floodgap.com (8.9.1/2003.05.26) id WAA13998 for gopher@complete.org; Tue, 24 Jun 2003 22:39:42 -0700 From: Cameron Kaiser Message-Id: <200306250539.WAA13998@floodgap.com> Subject: [gopher] Veronica-2 again, and one last robots.txt argument To: gopher@complete.org Date: Tue, 24 Jun 2003 22:39:42 -0700 (PDT) X-Mailer: ELM [version 2.4ME+ PL39 (25)] MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 772 X-ecartis-version: Ecartis v1.0.0 Sender: gopher-bounce@complete.org Errors-to: gopher-bounce@complete.org X-original-sender: spectre@floodgap.com Precedence: bulk Reply-to: gopher@complete.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: Gopher X-List-ID: Gopher List-subscribe: List-owner: List-post: List-archive: X-list: gopher The new crawler just took its first step tonight by taking a harnessed walk around gopher.floodgap.com. It reads and understands robots.txt files (the User-agent is veronica, or *), correctly traverses trees, and generates sane indexes. Loop protection and auto-pruning will get tested later. Revisiting robots.txt for a bit, the current logic has the following consequences. * If you Disallow: / in your robots.txt file, not only will your site not be indexed, but its very existence not even registered in the statistics table (and consequently will not appear on the master list of servers). * Disallow: intentionally says nothing about the itemtype, both because this is selector-oriented, and at least one person here (John) wanted as much overlap between the Web and gopher robots.txt files so that one filesystem can be presented both ways, and the robots.txt understood by both V-2 and any web robots. The consequence is this. Any gopher server that requires an "internal" itemtype to be transmitted back to it (URLs like x.yz.com:70/11/something where the actual selector is 1/something) MUST include this in the Disallow: block (e.g., for this example, Disallow: 1/something). * Disallow: /path/ works for both /path and /path/ (not substrings of same). If this will cause trouble for people, advise ASAP. I'm planning to unleash the crawler sometime in the next week or two. -- ---------------------------------- personal: http://www.armory.com/~spectre/ -- Cameron Kaiser, Floodgap Systems Ltd * So. Calif., USA * ckaiser@floodgap.com -- If you want divine justice, die. -- Nick Seldon ----------------------------