A picture is worth a 1000 words (and more)
Here's a numeric data-set in 3 dimensions. I challenge anyone to try and make sense of it.
- What is this data about?
- Can you detect any patterns?
- Does it make any sense?
- Do you see errors or unlikely outliers in the data?
Looking at the data directly won't get you far. You can spend a few seconds, minutes, hours, maybe even days looking at these 3 numeric columns without gaining any insight.
What's the point?
The human brain is not good coping with a large number of tiny details. Humans can typically grasp at most 7 +/-2 items at a time.
This particular data-set has over 5000 lines and over 15,000 numbers in it.
But there's hope!
What if we could turn this data into a picture?
Since this data-set has 3 columns, it would be natural to map each column into one dimension in 3D space:
(x, y, z)Here's that picture [scroll down to see it]:
[Data credit: Ross Ihaka]Maunga Whau (Mt Eden) is one of about 50 volcanos in the Auckland, New Zealand, volcanic field. This data set gives topographic information for Maunga Whau on a 10m by 10m grid.
With all the data in one picture it becomes immediately obvious what the data is about. It is a topographical (from above) view of an oval shaped volcano, with uneven crater wall-heights, highest point on the left side of the crater. The volcano has 3 ridges sloping down on the lower-end side, the crater is on the left side, a small local peak (in yellow) can be seen to the right of the crater... etc. We can start noticing many details and describe them. We can even memorize the picture in our minds and reproduce it pretty closely from memory later.
Visualization meets anomaly detection
Now, what if someone messed with our data?
Let's continue with another, related, exercise and change the z-value (3rd column) of just 5 (out of over 5,000) points to some arbitrary value taken from somewhere else in the data-set. It would be extremely hard to notice this small change by looking at the numbers directly. But in the picture, our small messing-up with the data stands-out immediately.
[scroll down to continue]
The bad data shows as a vertical blue line where it isn't supposed to be:
in the middle of the smooth green area on the lower right-hand slope of the volcano:
The picture allows us, not just to grasp the whole data set in the blink of an eye, but also to notice a tiny case of tampering. The importance of visualization in making sense of data cannot be overstated.
Complex models have their place, but 'debugging' them, and fully understanding them is another matter. Simple, direct rendering and visualization of data is often the best data insight/debugger known to me. I hope that this example has convinced you of this fact too.
Any feedback is welcome.-- ariel