Profiling a multi-tiered distributed web application
One picture is worth a 1000 words

Ariel Faigon, Dec 2012
Download the stacked-density latency distribution by API script

1. The Problem

We have a multi-tiered and pretty complex, web application. It is running on many servers: database, business layer, various services, and we want to make it faster so our users are happier and more engaged.

We want to understand our application better and figure out the performance bottlenecks.

We need a profiler for multi-tiered, distributed, web applications.

BTW: there are companies who sell very expensive, dedicated software to other big corporations to do just this. I prefer simplicity, generality, and free software.

2. The Challenge

If our application was a single program, written in one programming language, we would use a profiler to give us a summary of all function calls, the average duration and number of times each of the functions was called, etc.

But we cannot directly profile an application which is:

Thinking pause...

Let's break the problem into two parts:

Collecting the data

Assume we can break our web application into its individual constituent parts (APIs, micro-services, or basic functions) and measure the time it takes each of these parts to complete. Each part is called thousands of times a day, so we could collect this data over a full day or so on multiple hosts. When the day is over we would have a pretty big & relevant data-set.

Epiphany: substitute function with URL, and our web-logs are "it". The data we need is already there:

So part one of our problem (collecting the data) is solved.

Presenting the data

So now we have a big data-set, but still no real insight. How can we gain insight?

Epiphany #2: visualize our of (multiple) web-server logs directly.

A picture is worth a 1000 words.

We need to condense 100s of thousands or millions lines of multiple web-server logs into a short summary which would tell most of the story about our performance. In other words: the goal is to generate a profiler-like report, or even better: a graphical profiler report, directly from our web logs.

We could map:

One such picture: a stacked-density chart of latencies by API appears below.

Note: the data is real, but functions names were made-up for illustrative purposes.

3. The Chart (click to enlarge):

stacked latency distribution of an application by API
The 1000 word story in a single (stacked density by API) chart

4. How to do it?

The first step is to pre-process and extract the subset of needed-data from the logs. A trivial utility like Unix cut on multiple logs may be sufficient here. You would also need to use perl, sed or awk to collapse multiple similar URLs into shorter strings describing the functionality/API they represent: say 'registration', or 'purchase' for readability. Ideally you should have no more than about 50 main APIs to summarize.

If you have a multi-host unified logging, generated by some network equipment like a central load-balancer, this task should be easier.

We extract only the API names (shortened URLs) and their latencies, so we end up with one big file with a pair of columns, separated by TABs: (API   Latency_in_microSecs):

URL_01	32734
URL_01	32851
URL_06	598452
...
URL_11	232734

Next we run the R script below on the data pairs to produce our chart (using Hadley Wickham's ggplot2 library).

Voilla!

5. Observations

The stacked-layers chart tells many stories:

What about client-side performance?

Everything above focuses on server-side performance. The end user experience is highly dependent of web-page design: number of components, their sizes, use of CDN, DOM structure, etc. This is outside the scope of this work.

Analyzing client side factors is already well supported in modern browsers (e.g. in Firefox: Web developer, network, reload page: waterfall chart).

Alternatively, you may want to look at www.webpagetest.org or similar tools for client-perspective analysis.

6. Charting Code (R/ggplot source)

#!/usr/bin/Rscript --vanilla
#
# Generate stacked chart of API latencies by API from a TSV data-set
#
# ariel faigon - Dec 2012
#
.libPaths(c('~/local/lib/R',
             '/usr/lib/R/library',
             '/usr/lib/R/site-library'
))

suppressPackageStartupMessages(library(ggplot2))
# grid lib needed for 'unit()':
suppressPackageStartupMessages(library(grid))

#
# Constants: width, height, resolution, font-colors and styles
# Adapt to taste
#
wh.ratio = 2
WIDTH = 8
HEIGHT = WIDTH / wh.ratio
DPI = 200
FONTSIZE = 11
MyGray = gray(0.5)

title.theme   = element_text(family="FreeSans", face="bold.italic",
                            size=FONTSIZE)
x.label.theme = element_text(family="FreeSans", face="bold.italic",
                            size=FONTSIZE-1, vjust=-0.1)
y.label.theme = element_text(family="FreeSans", face="bold.italic",
                           size=FONTSIZE-1, angle=90, vjust=0.2)
x.axis.theme  = element_text(family="FreeSans", face="bold",
                            size=FONTSIZE-1, colour=MyGray)
y.axis.theme  = element_text(family="FreeSans", face="bold",
                            size=FONTSIZE-1, colour=MyGray)


# eprintf: handy function for prints/debugging
eprintf <- function(...) cat(sprintf(...), sep='', file=stderr())

#
# Function generating well-spaced & well-labeled y-axis (count) breaks
#
yscale_breaks <- function(from.to) {
    from <- 0
    to <- from.to[2]
    # round to 10 ceiling
    to <- ceiling(to / 10.0) * 10
    # Count major breaks on 10^N boundaries, include the 0
    n.maj = 1 + ceiling(log(to) / log(10))
    # if major breaks are too few, add minor-breaks half-way between them
    n.breaks <- ifelse(n.maj < 5, n.maj*2+1, n.maj)
    breaks <- as.integer(seq(from, to, length.out=n.breaks))
    # eprintf("from=%g to=%g n.maj=%g breaks %g\n", from, to, n.maj, breaks)
    breaks
}

#
# -- main
#

# -- process the command line args:  [tsv_file [png_file]]
#    (use defaults if they aren't provided)
#
argv <- commandArgs(trailingOnly = TRUE)
if (is.null(argv) || (length(argv) < 1)) {
    argv <- c(Sys.glob('*api-lat.tsv')[1])
}
tsvfile <- argv[1]
stopifnot(! is.na(tsvfile))
pngfile <- ifelse(is.na(argv[2]), paste(tsvfile, '.png', sep=''), argv[2])

# -- Read the data from the TSV file into an internal data.frame d
d <- read.csv(tsvfile, sep='\t', head=F)

# -- Give each data column a human readable name
names(d) <- c('API', 'Latency')

#
# -- Convert microseconds Latency (our weblog resolution) to seconds
#
d <- transform(d, Latency=Latency/1e6)

#
# -- Trim the latency axis:
#       Drop the few 0.001% extreme-slowest outliers on the right
#       to prevent them from pushing the bulk of the data to the left
Max.Lat <- quantile(d$Latency, probs=0.99999)
d <- subset(d, Latency < Max.Lat)


#
# -- API factor pruning
#       Drop rows where the APIs is less than small % of total calls
#
Rare.APIs.pct <- 0.001
if (Rare.APIs.pct > 0.0) {
    d.N <- nrow(d)
    API.counts <- table(d$API)
    d <- transform(d, CallPct=100.0*API.counts[d$API]/d.N)
    d <- d[d$CallPct > Rare.APIs.pct, ]
    d.N.new <- nrow(d)

    # eprintf("Before %d line   after %d lines (diff: %d)\n",
    #   d.N, d.N.new, d.N - d.N.new)
}

#
# -- Adjust legend item-height &font-size
#    to the number of distinct APIs we have
#
API.count <- nlevels(as.factor(d$API))
Legend.LineSize <- ifelse(API.count < 20, 1.0, 20.0/API.count)
Legend.FontSize <- max(6, as.integer(Legend.LineSize * (FONTSIZE - 1)))
legend.theme  = element_text(family="FreeSans", face="bold.italic",
                            size=Legend.FontSize,
                            colour=gray(0.3))

# eprintf("API.count=%d  Legend.LineSize.Factor=%g Legend.FontSize=%g\n",
#       API.count, Legend.LineSize, Legend.FontSize)


# -- set latency (X-axis) breaks and labels (s.b made more generic)
lat.breaks <- c(0, 0.0001, 0.001, 0.01, 0.1, 1, 10)
lat.labels <- sprintf("%g", lat.breaks)

#
# -- Generate the chart using ggplot
#
p <- ggplot(data=d, aes(x=Latency, y=..count../1000.0, group=API, fill=API)) +
       geom_bar(binwidth=0.01) +
          scale_x_log10(breaks=lat.breaks, labels=lat.labels) +
          scale_y_continuous(breaks=yscale_breaks) +
          ggtitle('APIs Calls & Latency Distribution') +
          xlab('Latency in seconds - log(10) scale') +
          ylab('Call count (in 1000s)') +
          theme(
                plot.title=title.theme,
                axis.title.y=y.label.theme,
                axis.title.x=x.label.theme,
                axis.text.x=x.axis.theme,
                axis.text.y=y.axis.theme,
                legend.text=legend.theme,
                legend.key.height=unit(Legend.LineSize, "line")
          )

#
# -- Save the plot into the png file
#
ggsave(p, file=pngfile, width=WIDTH, height=HEIGHT, dpi=DPI)

R code to generate the above chart