First ... Back ... Next ... Last ... (Notes home)

Working with Structured Data

  • Because Perl is so good at working with text and patterns, it's ideal for working with structured textual data. In fact, Perl once nearly stood for "Practical Extraction and Reporting Language."
  • Sample: Apache httpd log. Here are a few lines:
    205.188.116.212 - - [03/Oct/2004:18:16:53 -0700] "GET /viewimage.php3?ListingID=258 HTTP/1.1" 200 4131
    205.188.117.13 - - [03/Oct/2004:18:16:54 -0700] "GET /endorlr.jpg HTTP/1.1" 200 12215
    205.188.116.5 - - [03/Oct/2004:18:16:54 -0700] "GET /endorbath.jpg HTTP/1.1" 200 11052
    205.188.116.6 - - [03/Oct/2004:18:16:54 -0700] "GET /endordine.jpg HTTP/1.1" 200 15290
    205.188.117.20 - - [03/Oct/2004:18:18:06 -0700] "GET /helpful.html HTTP/1.1" 200 2735
    172.152.195.42 - - [03/Oct/2004:19:01:15 -0700] "GET / HTTP/1.1" 200 4074
    172.152.195.42 - - [03/Oct/2004:19:01:15 -0700] "GET /kingsley.css HTTP/1.1" 200 1854
    172.152.195.42 - - [03/Oct/2004:19:01:15 -0700] "GET /kingsley.gif HTTP/1.1" 200 9728
    172.152.195.42 - - [03/Oct/2004:19:01:15 -0700] "GET /mls.gif HTTP/1.1" 200 4481
    172.152.195.42 - - [03/Oct/2004:19:01:15 -0700] "GET /equal.gif HTTP/1.1" 200 935
    
  • Interesting fields are:
    1. Requestor IP address
    2. Date (which has sub-fields)
    3. Request, notably file requested (another sub-field)
    4. Return status
  • Possibly flow of control to split into components:
    • Get a line of input
    • Extract the request by splitting on "
    • Extract the date by splitting on []
    • Extract the IP and status code
  • Some code:
    #!/usr/bin/perl -w
    use strict;
    my ($tmpdate, $tmpip, $tmpreq, $tmpstat);
    my (@tmp, @sfall, @sfdate, @sfreq);
    
    my $infile = "access_log-sample.txt";
    open (IN, "<$infile") or die "Cannot open $infile: $!";
    
    while (<IN>) {
     chomp;                           # Ditch trailing newline/LF
     @sfreq = split( /\"/ );          # sub-field for request is between "
     @tmp = split ( / /, $sfreq[1]);  # grab the 2nd field [1] & split on spaces
     $tmpreq = $tmp[1];               # the 2nd sub-field is the actual request
    
     @sfall = split ( / / );          # split the whole request on spaces
     $tmpip = $sfall[0];              # 1st sub-field is the IP address
     # ... 
     # Get the idea?
    
    }
    
    
    

First ... Back ... Next ... Last ... (Notes home)

UAF Computer Science
Prof. Greg Newby