<?xml version="1.0"?>
<!DOCTYPE wml PUBLIC "-//WAPFORUM//DTD WML 1.1//EN"
"http://www.wapforum.org/DTD/wml_1.1.xml">
<wml>
<card id="index" title="Text File" newcontext="true">
<p>
Received: with ECARTIS (v1.0.0; list gopher);
 Thu, 13 Oct 2005 00:23:56 -0500 (CDT)
Received: from netblock-66-159-214-137.dslextreme.com
	([66.159.214.137] helo=floodgap.com ident=nobody)
	by glockenspiel.complete.org with esmtp
	(Exim 4.50)
	id 1EPvZJ-0001N6-RN
	for gopher@complete.org; Thu, 13 Oct 2005 00:23:56 -0500
Received: (from spectre@localhost)
	by floodgap.com (6.6.6.666/2005.03.01) id WAA09612
	for gopher@complete.org; Wed, 12 Oct 2005 22:23:09 -0700
From: Cameron Kaiser &lt;spectre@floodgap.com&gt;
Message-Id: &lt;200510130523.WAA09612@floodgap.com&gt;
Subject: [gopher] Re: New Gopher Wayback Machine Bot
In-Reply-To: &lt;20051013025233.GA26984@katherina.lan.complete.org&gt; from John
 Goerzen at &quot;Oct 12, 5 09:52:33 pm&quot;
To: gopher@complete.org
Date: Wed, 12 Oct 2005 22:23:09 -0700 (PDT)
X-Mailer: ELM [version 2.4ME+ PL39 (25)]
MIME-Version: 1.0
Content-type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 8bit
X-Spam-Status: No (score 0.4): AWL=0.345, FORGED_RCVD_HELO=0.05
X-Virus-Scanned: by Exiscan on glockenspiel.complete.org at Thu,
 13 Oct 2005 00:23:56 -0500
X-archive-position: 1117
X-ecartis-version: Ecartis v1.0.0
Sender: gopher-bounce@complete.org
Errors-to: gopher-bounce@complete.org
X-original-sender: spectre@floodgap.com
Precedence: bulk
Reply-to: gopher@complete.org
List-help: &lt;mailto:ecartis@complete.org?Subject=help&gt;
List-unsubscribe: &lt;mailto:gopher-request@complete.org?Subject=unsubscribe&gt;
List-software: Ecartis version 1.0.0
List-Id: Gopher &lt;gopher.complete.org&gt;
X-List-ID: Gopher &lt;gopher.complete.org&gt;
List-subscribe: &lt;mailto:gopher-request@complete.org?Subject=subscribe&gt;
List-owner: &lt;mailto:jgoerzen@complete.org&gt;
List-post: &lt;mailto:gopher@complete.org&gt;
List-archive: &lt;http://www.complete.org/mailinglists/archives/&gt;
X-list: gopher
</p>
<p>&gt; &gt; &gt; Cameron, floodgap.com seems to have some sort of rate limiting and keeps
&gt; &gt; &gt; giving me a Connection refused error after a certain number of documents
&gt; &gt; &gt; have been spidered.
&gt; &gt;
&gt; &gt; I&#x27;m a little concerned about your project since I do host a number of large
&gt; &gt; subparts which are actually proxied services, and I think even a gentle bot
&gt; &gt; going methodically through them would not be pleasant for the other side
&gt; &gt; (especially if you mean to regularly update your snapshot).
&gt;
&gt; Valid concern.  I had actually already marked your site off-limits
&gt; because I noticed that.  Incidentally, your robots.txt doesn&#x27;t seem to
&gt; disallow anything -- might want to take a look at that ;-)
</p>
<p>I know ;) it&#x27;s because Veronica-2 won&#x27;t harm the proxied services due to
the way it operates. However, I should be able to accomodate other bots that
may be around or come on board, so I&#x27;ll rectify this.
</p>
<p>&gt; &gt; I do support robots.txt, see
&gt; &gt;
&gt; &gt; 	gopher.floodgap.com/0/v2/help/indexer
&gt;
&gt; Do you happen to have the source code for that available?  I&#x27;ve got
&gt; some questions for you that it could explain (or you could), such as:
&gt;
&gt;  1. Which would you use?  (Do you expect URLs to be HTTP-escaped?)
&gt;
&gt;     Disallow: /Applications and Games
&gt;     Disallow: /Applications%20and%20Games
&gt;
&gt; 2. Do you assume that all Disallow patterns begin with a slash as they
&gt;    do in HTML, even if the Gopher selector doesn&#x27;t?
&gt;
&gt; 3. Do you have any special code to handle the UMN case where
&gt;    1/foo, /foo, and foo all refer to the same document?
&gt;
&gt; I will be adding robots.txt support to my bot and restarting it shortly.
</p>
<p>It does not understand URL escaping, but literal selectors only. In the
case of #2/#3, well, maybe it would be better just to post the relevant code.
It should be relatively easy to understand (in Perl, from the V-2 iteration
library). $psr is the persistent state hash reference, and key &quot;xcnd&quot; contains
a list of selectors generated from Disallow: lines with User-agent: veronica
or *.
</p>
<p>        # filter on exclusions
        my %excludes = %{ $psr-&gt;{&quot;$host:$port&quot;}-&gt;{&quot;xcnd&quot;} };
        my $key;
        foreach $key (sort { length($a) &lt;=&gt; length($b) } keys %excludes) {
                return (undef, undef, undef, undef, undef,
                                &#x27;excluded by robots.txt&#x27;, 1)
                        if ($key eq $sel || $key eq &quot;$sel/&quot; ||
                                ($key =~ m#/$# &amp;&amp;
                                substr($sel, 0, length($key)) eq $key));
        }
</p>
<p>As you can see from here, they would need to be specified separately, since
other servers might not treat them the same.
</p>
<p>--
---------------------------------- personal: http://www.armory.com/~spectre/ --
 Cameron Kaiser, Floodgap Systems Ltd * So. Calif., USA * ckaiser@floodgap.com
-- An apple every eight hours will keep three doctors away. -------------------
</p>
<p></p>
</card>
</wml>
