May 17, 2009: The importance of computing resources
A shiny new computer Parallelization effect Data size Thanks for the free software
A shiny new computerEvery day my computer spends a lot of time doing the following:
- Collecting new data and verifying its validity
- Compute & extract meaningful features out of the data
- Select the best features out of many that are available based on their prediction power
- Building new models based on the data across different time-frames
- Evaluating the models by estimating mean errors for these models using out of sample data, again across different time frames.
Trends that have developed for a while tend to eventually break; Correlations that have persisted between asset classes or sectors tend to shift; Volatility which stems from short-term sentiments tends to grow for a while and then just as it came, tends to decline and go away.
A model that has been recently very accurate in the short term, may be totally out of line when applied to longer-term periods. What remains relatively constant is the cyclical nature of the above: trends vs. mean-reversions on every parameter and time-frame.
And the more short term features the model looks into, the more computation it requires.
So I took a little bit of the gains PIE had recently, and bought a new shiny computer so I have to wait less time to see the daily results.
I went to two well known brand computer makers and tried to configure my wanted computer on their web pages. I wanted to have:
I always ended up with the price above $5000 dollars, including stuff I didn't want or need like "Windows Vista Home Edition Operating System" and the huge markups associated with every single part upgrade from the minimum, teaser-price, initial configurations. No matter how I sliced it.
- As many CPU cores as possible so I can parallelize my heavy processing
- Large memory, so all these processes have no problem fitting into RAM together
- A large disk to store many data-sets, simulations and models
- Good video for visualization, but not the noisy high-end with active (fan) cooling.
Can you imagine paying $5000 for a computer?
I'm cheap, so I decided to try the alternative approach:
To my great surprise, my dream computer broken into parts and self assembled (not including a monitor) came down to just $1205 including shipping and California tax. I know this 1 to 5 price ratio is hard to believe, but it is a fact. It seems like the brand names markups for premium configs are in the 5x range. As a specific example: when I tried to "upgrade" from a standard small hard disk drive config on the brand-name web site to a 1-TB option, the drop-down options menu showed "Add $499". The 1TB drive turned out to cost less than $100 when bought separately after an instant rebate from the manufacturer.
- Buy generic parts and build the box myself.
- Pick mid to high end (but never the most expensive option) parts from highly reputable manufacturers, by looking at best price/performance points and reading end-user recommendations.
Here's my detailed parts list so you can check it out yourselves, or build a similar box.
When my computer sits idle, it uses power-saving techniques and all the processors slow down to almost half frequency to save power. Here's the idle vs all-busy picture in kpowersave:
By breaking the task into independent data problems, running 10 processes in parallel on 8 computing units (quad-core CPU, each dual threaded) and a big enough memory I can run a 1 hour job in about 6 minutes. More importantly, when I happen to run a daily job on the past 400 days to do some backtesting, I can complete the job in about 40 hours as opposed to giving up because it would take two weeks to run on a single CPU.
Here's a top snapshot showing all 8 CPUs churning at close to 100% utilization. Note the 0% idle, system, wait-states etc. With all the data needed for this particular job fitting into physical memory, all this computing power goes straight into the CPU-bound user processes.
And here's a view of one of these parallel tasks showing 10 processes making progress in parallel. Note that since 10 processes are using only 8 available CPUs, the average percentage utilization per-process is somewhat lower than 100%.
Data sizeMy data is growing by leaps and bounds and at over 16GB is already bigger than all my photos, videos, and music collections.
The image on the right shows a multi-layer rendition of disk space usage, courtesy of a nice Linux application called "filelight".
When you hover your mouse over any of the colored sectors, filelight would show the name of the directory and the space it and all its children directories consume in both absolute and relative terms, plus the total number of files. Clicking on a leaf directory would drill down to its children adding a new external concentric layer.
Thanks for the free software
My new computer is running Ubuntu 9.04 (Jaunty Jackalope) with a 64-bit MP Linux kernel. I'm using R [wikipedia article], Weka [wikipedia article], perl+PDL [wikipedia article] and several less known machine learning tools to build and evaluate models.
The savings from not having to pay the "operating system tax" are not big (probably less than $200 for the "Operating system", and about $800 if one includes "office productivity software" too). It seems remarkable that this essential software cost can now amount to 2/3rds of a new high-end computer price.
But, the really big win doesn't come from the software cost saving component, but from the "free as in freedom" part: The flexibility to tinker, and the ability to run state-of-the-art analytical software. Being able to improve the code, get support from the community that shares it and keeps improving it. From the ability to look at the source code, modify those parts that an end user care about, and send occasional patches to the maintainers for inclusion in future releases.
I also can't help but comment on the upgrade cycle and its effect on speed.
As time passes and more and more software gets installed on a typical Windows computer, and every new piece of software seems to install an automatic updater and/or a component running permanently as a part of the start-up process. virus database signature files keep growing and growing taking longer to scan all the files in the system. No wonder the average Windows computer gets slower and slower over time.
Not so in the free software world: while there may be some performance regressions (unintentional, obviously), I find that almost in all cases, a distro/software upgrade on my Linux system makes the same computer hardware run faster over time.
As always, this isn't intended as investment advice. It merely reflects my own thinking and actions at the time of writing. In the immortal words of John Maynard Keynes "When the facts change, I change my mind. What do you do, sir?"
Every investor should make up his own decisions based on his risk tolerance, comfort-zones, convictions, and understanding.
Any feedback is welcome.