//
you're reading...
How-to, R, Uncategorized

Analysing access logs using R

Recently I was analysing a bunch of access logs searching for some tricky to find bug and found out that R may be quite handy in such a situation. So let’s take a look at a quick analysis of access logs written in Common Log Format.

Each line of an access log file looks like this:
commong_log_format
so the first thing we need to do is to read the data into our R session. This is a bit tricky as timestamps have whitespaces separating the time zone:

ReadLogFile <- function(file = log.file) {
    # http://en.wikipedia.org/wiki/Common_Log_Format
    access_log <- read.table(file, col.names = c("ip", "client", "user", "ts", 
        "time_zone", "request", "status", "response.size"))


    access_log$ts <- strptime(access_log$ts, format = "[%d/%b/%Y:%H:%M:%S")
    access_log$time_zone <- as.factor(sub("\\]", "", access_log$time_zone))
    access_log$status <- as.factor(sub("\\]", "", access_log$status))

    access_log
}

log.file <- "access_log"
access_log <- ReadLogFile(log.file)

Now let’s plot the server load:

library(ggplot2)
library(colorspace)
ggplot(access_log, aes(x = ts)) + geom_density(stat = "bin", binwidth = 3600, 
    colour = "black", fill = "darkgreen") + ylab("Requests/hour") + xlab("Time")

which looks like this
server_load

so a clear user activity pattern emerges – a lot more users during working hours and very few at nighttime.

A quick analysis of the the response codes reveals that there were some 5** errors:

df <- as.data.frame(table(access_log$status))
colnames(df) <- c('status','freq')

ggplot(df , aes(x="",y=freq, fill=status) )+
geom_bar(position= "fill",alpha=0.8,width=0.5, stat="identity")+
scale_fill_manual (values=rainbow_hcl(9, start = 200, end = 20))+  
coord_flip() + xlab("") + ylab("Frequency")

responses

Let’s take a closer look:

server.errors <- grep('5..',access_log$status)
ggplot(access_log[server.errors,], aes(x=status)) + geom_histogram(colour="black", fill="red")

http_errors

And finally: how the errors were distributed over time?

ggplot(access_log[server.errors,], aes(x=ts)) + 
  geom_density(stat='bin',binwidth=3600, colour="black",fill="red") +  
  ylab('Errors/hour') + xlab('Time')

error_load

About these ads

Discussion

2 thoughts on “Analysing access logs using R

  1. Very interesting, however I think it would have been better to use a more acceptable programming language. I would recommend Microsoft Visual Basic Version 6. It can read text files and output to the dialog boxes better than R soft.

    Posted by Weng Fu | April 12, 2013, 11:28 am
    • Actually R is getting more and more popular and it provides a variety of tools which you can use to perform much more sophisticated analyses the what
      I’ve posted here. One day I’m planning to explore some of the possibilities in a new post.

      Thanks for your comment.

      Posted by dratewka | April 12, 2013, 1:02 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: