Profiling a multi-tiered web-application

We have a multi-tiered and pretty complex, web application. It is running on many servers: database, business layer, various services, and we want to make it faster so our users are happier and more engaged.

We want to understand our application better and figure out the performance bottlenecks.

BTW: there are companies who sell very expensive, dedicated software to other big corporations to do just this. I prefer simplicity, generality, and free software.

2. The Challenge

If our application was a single program, written in one programming language, we would use a profiler to give us a summary of all function calls, the average duration and number of times each of the functions was called, etc.

Collecting the data

Presenting the data

So now we have a big data-set, but still no real insight. How can we gain insight?

Epiphany #2: visualize our of (multiple) web-server logs directly.

A picture is worth a 1000 words.

We need to condense 100s of thousands or millions lines of multiple web-server logs into a short summary which would tell most of the story about our performance. In other words: the goal is to generate a profiler-like report, or even better: a graphical profiler report, directly from our web logs.

We could map:

Call-latencies to one dimension (say X-axis)
Number of calls to another dimension (say Y-axis) and
Function identities to a color

One such picture: a stacked-density chart of latencies by API appears below.

Note: the data is real, but functions names were made-up for illustrative purposes.

The first step is to pre-process and extract the subset of needed-data from the logs. A trivial utility like Unix cut on multiple logs may be sufficient here. You would also need to use perl, sed or awk to collapse multiple similar URLs into shorter strings describing the functionality/API they represent: say 'registration', or 'purchase' for readability. Ideally you should have no more than about 50 main APIs to summarize.

If you have a multi-host unified logging, generated by some network equipment like a central load-balancer, this task should be easier.

We extract only the API names (shortened URLs) and their latencies, so we end up with one big file with a pair of columns, separated by TABs: (API Latency_in_microSecs):

URL_01	32734
URL_01	32851
URL_06	598452
...
URL_11	232734

What about client-side performance?

Analyzing client side factors is already well supported in modern browsers (e.g. in Firefox: Web developer, network, reload page: waterfall chart).

Alternatively, you may want to look at www.webpagetest.org or similar tools for client-perspective analysis.

6. Charting Code (R/ggplot source)

#!/usr/bin/Rscript --vanilla
#
# Generate stacked chart of API latencies by API from a TSV data-set
#
# ariel faigon - Dec 2012
#
.libPaths(c('~/local/lib/R',
             '/usr/lib/R/library',
             '/usr/lib/R/site-library'
))

suppressPackageStartupMessages(library(ggplot2))
# grid lib needed for 'unit()':
suppressPackageStartupMessages(library(grid))

#
# Constants: width, height, resolution, font-colors and styles
# Adapt to taste
#
wh.ratio = 2
WIDTH = 8
HEIGHT = WIDTH / wh.ratio
DPI = 200
FONTSIZE = 11
MyGray = gray(0.5)

title.theme   = element_text(family="FreeSans", face="bold.italic",
                            size=FONTSIZE)
x.label.theme = element_text(family="FreeSans", face="bold.italic",
                            size=FONTSIZE-1, vjust=-0.1)
y.label.theme = element_text(family="FreeSans", face="bold.italic",
                           size=FONTSIZE-1, angle=90, vjust=0.2)
x.axis.theme  = element_text(family="FreeSans", face="bold",
                            size=FONTSIZE-1, colour=MyGray)
y.axis.theme  = element_text(family="FreeSans", face="bold",
                            size=FONTSIZE-1, colour=MyGray)


# eprintf: handy function for prints/debugging
eprintf <- function(...) cat(sprintf(...), sep='', file=stderr())

#
# Function generating well-spaced & well-labeled y-axis (count) breaks
#
yscale_breaks <- function(from.to) {
    from <- 0
    to <- from.to[2]
    # round to 10 ceiling
    to <- ceiling(to / 10.0) * 10
    # Count major breaks on 10^N boundaries, include the 0
    n.maj = 1 + ceiling(log(to) / log(10))
    # if major breaks are too few, add minor-breaks half-way between them
    n.breaks <- ifelse(n.maj < 5, n.maj*2+1, n.maj)
    breaks <- as.integer(seq(from, to, length.out=n.breaks))
    # eprintf("from=%g to=%g n.maj=%g breaks %g\n", from, to, n.maj, breaks)
    breaks
}

#
# -- main
#

# -- process the command line args:  [tsv_file [png_file]]
#    (use defaults if they aren't provided)
#
argv <- commandArgs(trailingOnly = TRUE)
if (is.null(argv) || (length(argv) < 1)) {
    argv <- c(Sys.glob('*api-lat.tsv')[1])
}
tsvfile <- argv[1]
stopifnot(! is.na(tsvfile))
pngfile <- ifelse(is.na(argv[2]), paste(tsvfile, '.png', sep=''), argv[2])

# -- Read the data from the TSV file into an internal data.frame d
d <- read.csv(tsvfile, sep='\t', head=F)

# -- Give each data column a human readable name
names(d) <- c('API', 'Latency')

#
# -- Convert microseconds Latency (our weblog resolution) to seconds
#
d <- transform(d, Latency=Latency/1e6)

#
# -- Trim the latency axis:
#       Drop the few 0.001% extreme-slowest outliers on the right
#       to prevent them from pushing the bulk of the data to the left
Max.Lat <- quantile(d$Latency, probs=0.99999)
d <- subset(d, Latency < Max.Lat)


#
# -- API factor pruning
#       Drop rows where the APIs is less than small % of total calls
#
Rare.APIs.pct <- 0.001
if (Rare.APIs.pct > 0.0) {
    d.N <- nrow(d)
    API.counts <- table(d$API)
    d <- transform(d, CallPct=100.0*API.counts[d$API]/d.N)
    d <- d[d$CallPct > Rare.APIs.pct, ]
    d.N.new <- nrow(d)

    # eprintf("Before %d line   after %d lines (diff: %d)\n",
    #   d.N, d.N.new, d.N - d.N.new)
}

#
# -- Adjust legend item-height &font-size
#    to the number of distinct APIs we have
#
API.count <- nlevels(as.factor(d$API))
Legend.LineSize <- ifelse(API.count < 20, 1.0, 20.0/API.count)
Legend.FontSize <- max(6, as.integer(Legend.LineSize * (FONTSIZE - 1)))
legend.theme  = element_text(family="FreeSans", face="bold.italic",
                            size=Legend.FontSize,
                            colour=gray(0.3))

# eprintf("API.count=%d  Legend.LineSize.Factor=%g Legend.FontSize=%g\n",
#       API.count, Legend.LineSize, Legend.FontSize)


# -- set latency (X-axis) breaks and labels (s.b made more generic)
lat.breaks <- c(0, 0.0001, 0.001, 0.01, 0.1, 1, 10)
lat.labels <- sprintf("%g", lat.breaks)

#
# -- Generate the chart using ggplot
#
p <- ggplot(data=d, aes(x=Latency, y=..count../1000.0, group=API, fill=API)) +
       geom_bar(binwidth=0.01) +
          scale_x_log10(breaks=lat.breaks, labels=lat.labels) +
          scale_y_continuous(breaks=yscale_breaks) +
          ggtitle('APIs Calls & Latency Distribution') +
          xlab('Latency in seconds - log(10) scale') +
          ylab('Call count (in 1000s)') +
          theme(
                plot.title=title.theme,
                axis.title.y=y.label.theme,
                axis.title.x=x.label.theme,
                axis.text.x=x.axis.theme,
                axis.text.y=y.axis.theme,
                legend.text=legend.theme,
                legend.key.height=unit(Legend.LineSize, "line")
          )

#
# -- Save the plot into the png file
#
ggsave(p, file=pngfile, width=WIDTH, height=HEIGHT, dpi=DPI)

Profiling a multi-tiered distributed web application
One picture is worth a 1000 words

1. The Problem

2. The Challenge

Collecting the data

Presenting the data

3. The Chart (click to enlarge):

4. How to do it?

5. Observations

What about client-side performance?

6. Charting Code (R/ggplot source)

Profiling a multi-tiered distributed web application One picture is worth a 1000 words

1. The Problem

2. The Challenge

Collecting the data

Presenting the data

3. The Chart (click to enlarge):

4. How to do it?

5. Observations

What about client-side performance?

6. Charting Code (R/ggplot source)

Profiling a multi-tiered distributed web application
One picture is worth a 1000 words