|
Bots section | |
fra_00xx 98xxxx deep 1000 BO PC | ||
Perl is a good language to learn - fairly straightforward, quick, very powerful and ideal for bots, cgi and the net generally! I hope that +fravia will publish this as part of the botstart section and that the bot section will start boting.
Perl (standard on Linux and freely available) and
various Perl modules (small, free
downloads),
net access,
a text editor,
Linux (not absolutely necessary, but a far superior,
free and real operating system).
What can I say about Perl? It's a good language to learn. Virtually all cgi is done in Perl but it's good for virtually anything that you'd care to do and it's possible to develop applications very quickly. I'm not yet that experienced at Perl - this is my first 'real' app and I'm certain that this bot is not written at all well, but it is written. Perhaps that's the best thing about Perl - it enables you to do things that would not otherwise be possible. The CPAN Perl code repository on the net holds vast quantities of free code to do almost anything you could ever wish - but you have to be able to use Perl. You will need to download at least the LWP (it stands for libwwwperl) module from CPAN for this or almost any Perl bot to work.
There are many Perl bots available on the net, but I'm fairly certain that you will not find one that does exactly what you want. There's also a convention amoung bot writers not to give bots to people who do not understand them - it's considered irresponsible. Of course, once you've learned how to build bots, you can be as irresponsible as you like. What all this means is that you have to learn to appreciate them and Perl or you don't deserve them. Don't worry, it's easy enough - just a little effort.
Please note that this is not good Perl code and I am not a programmer. This bot shows that to start using Perl you only need to understand a little. The accepted approach for newbies writing Perl5 applications is to get them working first, then improve on them if necessary.
Here's a very simple web retrieval bot I've written that retrieves many web pages from a single site. This bot is fairly limited in what it can achieve, (and bots can do far more than download web pages) but you are free to add any functionality you like - just write the code.
The hcuBOT/0.2 is written as a Linux application - it will need work to use on windoze (I recommend installing Linux;). ActivePerl (1.5 meg) needs to be downloaded to use Perl under windoze.
Perl helps development all the way with excellent error messages. You can write it cryptically or you can write it simply. I'm going to write it simply until I learn more - I hope that this code is fairly clear. Use 'use diagnostics' and the -w switch only while developing - they can cause strange messages to be sent to servers. If something doesn't work, try it a slightly different way. I tend to use print statements to identify where perl fails and this seems to work well but there's also a very good debugger built in.
There are notes after the source to explain what's happening.
#!/usr/bin/perl # -w # use diagnostics; use LWP::RobotUA; use HTML::Parser; use URI::URL; use POSIX; use DB_File; my $url; my $arg = (shift @ARGV); my $domain_name = "http://".$arg."/"; my @get_list = $domain_name; local (%main,%localise); local $counter = 0; # local files my $maxcount = 100; my $dirname = $arg; # subclass package ParseLink based on Randal L. Schwartz's ~ see { # http://www.stonehenge.com/merlyn/WebTechniques/col07.html package ParseLink; @ISA = qw(HTML::Parser); sub start { # called by parse my $this = shift; my ($tag, $attr) = @_; if ($tag eq "a") { $this->{links}{$attr->{href}}++; } } sub get_links { my $this = shift; sort keys %{$this->{links}}; } } change_dir($arg); tie(%main, DB_File, 'main-sdbm', O_RDWR | O_CREAT, 0666) || die "$0: tie() failed : $!\n"; tie(%localise, DB_File, 'local-sdbm', O_RDWR | O_CREAT, 0666) || die "$0:tie() failed : $!\n"; $ua = new LWP::RobotUA 'hcuBOT/0.2','jclinton@whitehouse.gov'; $ua->delay(0.01); while (($url = shift @get_list) && ($counterrequest($req); # uncomment for request headers # print "\$req->as_string is\n"; print $req->as_string; # uncomment for ALL response # print "\$res->as_string is\n"; print $res->as_string; if ($res->is_error()) { printf "%s\n", $res->status_line; next; } else { save_html($url,$res->content); extract_hyperlinks(); } } edit_main_hash(); localise_hyperlinks(); sub change_dir { local ($domain) = @_; chdir(); # to user's home dir if (! ( -d $domain)) { # make dir beneath user's home dir mkdir($domain,0777) or die "$0: Unable to create directory $domain: $!\n"; } chdir($domain) or die "$0: Unable to chdir to $domain : $!\n"; return 0; } sub save_html { my ($url,$data) = @_; $counter++; open(FILE,">$counter.bot") or die "$0: Unable to save ",$url," as ",$counter,".bot $!\n"; print FILE $data; close FILE; $main{$url} = "$counter\.bot"; # %main hash entry for $url to local filename return 0; } sub extract_hyperlinks { my $base = $res->base; my $p = ParseLink->new; $p->parse($res->content); $p->parse(undef); for $link ($p->get_links) { my $abs = url($link, $base)->abs; if (exists $main{$abs}) {next;} # already queued or retrieved if ($abs !~ /$domain_name/o) {next;} # outside domain if ($abs !~ /.*htm.?$/ois) {next;} # not terminating with string *htm* # if ($abs =~ /#/o) {next;} # containing any anchor push(@get_list, $abs); print "Selected $abs for retrieval\n"; $main{$abs} = ""; # only queue doc once $localise{$link} = $abs; # for localising links } } sub localise_hyperlinks { # not really sure about this subroutine my @files = glob("*.bot"); # grep directory foreach $file(@files) { open(READFILE," ; close READFILE; foreach $line(@document) { if (($match) = ( $line =~ /]+?HREF\s*=\s*["']?([^'" >]+?)['"]?>/gio )) { if (defined $main{$match}) { $line =~ s/$match/$main{$match}/; } elsif (($localise{$match}) && ($main{$localise{$match}} ne "")) { $line =~ s/$match/$main{$localise{$match}}/; } elsif ($localise{$match}) { $line =~ s/$match/$localise{$match}/; } } } open(WRITEFILE,">$file") or die "$0 : Unable to open $file for writing: $!\n"; print WRITEFILE @document; close WRITEFILE; } } sub edit_main_hash { # this sub's purpose is to edit the main hash so that it only contains # key-value pairs for the mirror function when program is next run. # Mirror function not yet implemented. my @keys = keys %main; foreach $key(@keys) { my $value = $main{$key}; if ($value eq "") { delete $main{$key}; } } } __END__ As a mirroring function yet to been implemented, sub edit_main_hash is now somwhat redundant. The sdbm storage of data to disk while program is executing does however function to reduce program memory use.
The bot replaces a browser, sending requests for web pages and receiving responses. It can even pretend to be a browser - any browser you like. This line
$ua = new LWP::RobotUA 'hcuBOT/0.2','jclinton@whitehouse.gov';
identifies the bot as hcuBOT/0.2, while the jclinton... is the email address the server administrator should contact if your bot screws up her server - she'll send you an awfully polite email. So to pretend to be a particular browser, you would replace hcuBOT/0.2 with something like "Mozilla/3.1". You'll have to check the actual string that the browser sends.
hcuBOT/0.2 sends a GET command to the server. It says that it wants particular web pages by saying GET this url with the url of the document that you're after. There are other commands - MIRROR, HEAD, POST and a few others. Mirror compares the document on the server with your local document. If the server's document is newer, that document is retrieved. Mirror works by sending a request with an if-modified-since (date/time of your document) header.
Let's take a look at some headers that hcuBOT/0.2 works with.
GET http://www.oracle.com/ # Here's the request header From: jclinton@whitehouse.gov User-Agent: hcuBOT/0.2 HTTP/1.1 200 OK # Here's the response header, that we Cache-Control: public # get back from the server Date: Thu, 20 Jul 1999 20:18:19 GMT Accept-Ranges: bytes Server: Oracle_Web_Listener/4.0.7.1.0EnterpriseEdition Allow: GET, HEAD Content-Length: 12723 Content-Type: text/html ETag: "8ef7c2d83beac682e5b0bb90ecc3791a" Last-Modified: Thu, 20 Jul 1999 16:31:27 GMT Client-Date: Thu, 20 Jul 1999 23:28:07 GMT Client-Peer: 205.207.44.16:80 Title: Oracle Corporation - Home X-Meta-Description: Oracle Corp. (Nasdaq: ORCL) is the world's leading supplier of software for enterprise information management. X-Meta-Keywords: database,software,Oracle,Oracle8i,relational server, server,application,tools,decision support tools,internet,internet computing, CRM,customer relationship management,e-business,PL/SQL,XML,Year 2000,Euro, Java, technology <html> # and the html document requested with a GET starts here.
Quite a whopper that response header, they're not normally that big. The request is simple on this one, it's jclinton@whitehouse.gov saying GET http://www.oracle.com/ using User-Agent: hcuBOT/0.2.
The important part of the response is the first line "HTTP/1.1 200 OK".
Hypertext TransferProtocol (HTTP) will either be 1.1 or 1.0. Version 0.9 only supports the GET method and is not used now as far as I'm aware. 1.0 supports GET, HEAD, POST, PUT, DELETE, LINK and UNLINK. 1.1 supports a few extra methods. This header says that it will accept HEAD and GET requests.
An important part is the response code. We want response code 200 as shown here which is the server replying "OK, here's the document you asked for". Response codes 100 to 199 are not implemented. 200 is what we want. 200-299 are request successfull, but that doesn't really mean that you'll get the document. 300-399 are redirection which can cause a bit of trouble. 400 is bad request (syntax error in the request header), 404 is document not found - just like when you click on a stale link. 400 - 499 you don't want. Server Errors are the 500 range which you don't want. 500 is internal server error, one that you don't want but will get often.
Here's a request header with a referer. It's saying "I want http://www.oracle.com/html/custcom.html, I got this url from http://www.oracle.com/".
dev - $request->as_string is GET http://www.oracle.com/html/custcom.html From: jclinton@whitehouse.gov Referer: http://www.oracle.com/ User-Agent: hcuBOT/0.2
hcuBOT/0.2 uses the LWP (libwwwperl) perl module which is a predefined library of code written by Gisle Aas that deals with net protocols. So, to write a bot in C++, for example, you would use a networking library by using the include command. The program calls on functions in these stored libraries and LWP relieves the programmer (that's me or you) of sockets programming. A socket is how you program the net - you read and write to a socket like you would read or write to a file except that it's more complex.
hcuBOT/0.2 uses LWP::RobotUA. Robot User Agent is an appropriate module for web robots and is often called 'polite' because it's careful not to annoy servers. It is 'polite' by identifying itself to the server with a contact email address, following the robots exclusion standard and by delaying requests to the server. The delay, however, defaults to one minute which is far too long for today's servers.
Other LWP modules that can be used instead of RobotUA are LWP::Simple for 'simple' applications, LWP::UserAgent ~ the parent class of RobotUA which does not have the polite features ~ and LWPng 'the next generation' which will replace LWP. See the lwpcookbook included with lwp for examples and usage of lwp.
This is how hcuBOT/0.2 works.
Perl is not the only language to write bots.
You can install Linux to your Windoze machine - you
know you want to.
You could try something like this at altavista '+Perl
+tutorial'or '+Perl +robot
+tutorial'
See lwp-rget - an example web download bot that's
comes with LWP