Received: with ECARTIS (v1.0.0; list gopher); Mon, 31 May 2004 03:12:27 -0500 (CDT) Return-Path: X-Original-To: gopher@complete.org Delivered-To: gopher@complete.org Received: from localhost (localhost [127.0.0.1]) by glockenspiel.complete.org (Postfix) with ESMTP id 7418B3E9 for ; Mon, 31 May 2004 03:12:26 -0500 (CDT) Received: from glockenspiel.complete.org ([127.0.0.1]) by localhost (glockenspiel [127.0.0.1]) (amavisd-new, port 10025) with ESMTP id 28534-03 for ; Mon, 31 May 2004 03:12:25 -0500 (CDT) Received: from floodgap.com (netblock-66-159-214-137.dslextreme.com [66.159.214.137]) by glockenspiel.complete.org (Postfix) with ESMTP id 364E23DB for ; Mon, 31 May 2004 03:12:22 -0500 (CDT) Received: (from spectre@localhost) by floodgap.com (8.9.1/2004.05.05) id BAA09944 for gopher@complete.org; Mon, 31 May 2004 01:26:46 -0700 From: Cameron Kaiser Message-Id: <200405310826.BAA09944@floodgap.com> Subject: [gopher] Re: Cicada Incomplete Gopher Census In-Reply-To: <20040530230758.GA27407@nerds.cs.umd.edu> from Tim Fraser at "May 30, 4 07:07:59 pm" To: gopher@complete.org Date: Mon, 31 May 2004 01:26:46 -0700 (PDT) X-Mailer: ELM [version 2.4ME+ PL39 (25)] MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-Virus-Scanned: by amavisd-new-20030616-p7 (Debian) at complete.org X-archive-position: 930 X-ecartis-version: Ecartis v1.0.0 Sender: gopher-bounce@complete.org Errors-to: gopher-bounce@complete.org X-original-sender: spectre@floodgap.com Precedence: bulk Reply-to: gopher@complete.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: Gopher X-List-ID: Gopher List-subscribe: List-owner: List-post: List-archive: X-list: gopher > > After the V-2 cleanup this weekend, it has pared itself down to > > 255 unique hosts and a database of about 1.8 million selectors. > > OK, I found only 154, so I clearly have a bug. My selector counts > seem very low, too. I'm not sure it's worth debugging given that the > floodgap index is updating again, but just in case I get bored: my > spider is supposed to follow only selectors with type 1 or 11. Are > there other directory types that I should follow? Besides the fact that '11' per se isn't an itemtype (it's just a '1'), no, that's all this robot follows. I have the advantage of having had an old partially filled hosts database to iterate through, so my host list fans out faster than if it were left to discover hosts entirely unaided. > How does floodgap's Veronica-2 spider limit the load it places on > sites? Does it check for a robots.txt file, or some similar mechanism? Yes (see gopher://gopher.floodgap.com/0/v2/help/indexer ), and it also has a methodology for rotating through a list of hosts it's working through, trying not to bang on any one host much more than a couple times per minute at the very most. I'm almost complete with tuning changes and the indexer probably will be released again sometime tomorrow afternoon. -- ---------------------------------- personal: http://www.armory.com/~spectre/ -- Cameron Kaiser, Floodgap Systems Ltd * So. Calif., USA * ckaiser@floodgap.com -- "Another day, another dangling modifier" -----------------------------------