Plotting Spilled… Ink?

Michael Stone, February 25, 2012, , (src), (all posts)

Ever since I bought Hadley Wickham’s lovely book “ggplot2: Elegant Graphics for Data Analysis (Use R!)” a few weeks back, I’ve been meaning to write up a simple end-to-end example of data collection and plotting using ggplot2.

Thus, without further delay, let’s try to make a pretty picture of the rate at which I’ve been writing here (and thus, of the rate at which my rather naive site search implementation’s dataset is growing).

Here’s what we’ll do:

  1. Install R and ggplot2:

    sudo aptitude install r-base r-cran-ggplot2
  2. Collect the data:

    echo date bytes post > data.txt
    (for f in $(find posts -name 'index.txt'); do
      DATE=$(cat $f | head -n 3 | tail -n 1 | sed -e 's/^% //');
      echo $(date -d "$DATE" +%s) \
           $(stat -c '%s' $f) \
           $(echo $f | sed -e 's,posts/,,' -e 's,/index.txt,,');
    done) | sort -n -k1 >> data.txt
  3. Sanity-check the resulting data:

    $ head data.txt
    date bytes post
    1232600400 2038 joy_of_tex
    1234674000 3947 irrefutability
    1275796800 2076 openkey
    1300248000 5958 afd_discussions
    1300593600 1358 safe_phones
    1301371200 1126 convergence
    1302235200 1404 secrets
    1302408000 2916 comment_systems
    1307160000 833 scheduling
  4. Make the plot:

    $ R
    library("ggplot2")                                      # load ggplot2
    
    df <- read.table("data.txt", header = TRUE)             # load the data
    
    ndf <- df[order(df$date),]                              # sort the data
    ndf$date2 <- as.POSIXct(ndf$date, origin="1970-01-01")  # convert timestamps to dates
    ndf$total_bytes <- cumsum(ndf$bytes)                    # count total bytes over time
    
    svg(filename="data.svg", width=6, height=4)             # make the plot
    qplot(x = date2, y = total_bytes, data=ndf, xlab="date", ylab="total bytes")
    dev.off()
  5. Enjoy:

(P.S. - Care to guess when I joined Iron Blogger? :-)