Sunday, June 14, 2015

Parallel and a new laptop

I am thinking about a new laptop. For one thing a 1366*768 resolution just seems to get impractically small. Secondly, faster comutations, more memory.
Regarding CPU speed, my current laptop has a lowly Celeron 877. From what I see at my computers activity, under R it is mostly one core which does the work. Which means that even though there are two cores the single core CPU mark of 715 (from cpubenchmark.net) is what I have available. A bit of checking shows the current batch of processors has mainly more cores. For instance, the highest rated common CPU, an Intel Core i7-4710HQ, has a CPU mark of 7935 and single core 1870. That is 2.5 times faster for one core. But it is best because there are four cores. The same is true down the line. Four cores is common. But single core speed has not improved that much. Unless I can actually use those extra cores, what is the gain? Hence I am wondering, can I do something with extra cores for real world R computations? For this I can investigate.

Easy approach, Parallel

A bit of browsing shows that the parallel package is the easy way to use multiple cores, think of using mclapply() rather than lapply. And in many situations this is easy, for instance, cross validation is easy, except for the small upfront cost of partitioning the data in chunks. Trying different settings for a machine learning problem is similar.
To give this a certain real world setting, data was taken from the UCI machine learning repository:  Physicochemical Properties of Protein Tertiary Structure Data Set which has 45730 rows and 9 variables. A bit of plotting shows this figure for 2000 random selected rows. It seems the problem is not so much which variables to use but rather interactions. This was also suggested by poor performance of linear regression.

Random forest in parallel

Even though nine variables is a bit low for random forest, I elected to use it as first technique. The main variables to tune are nodesize and number of variables to try. Hence I wrapped this in mclapply, not even using a cross validation and taking care not to nest the mclapply calls. The result was a big usage of memory. Which in hindsight may be obvious. Each of the instances gets a complete data set. The net effect is that I ran out of RAM and data was swapped. This cannot be good for performance. It may also explain comments I have read that the caret package uses too much memory. A decent set of hardware for machine learning including a four core processor would create four instances of the same data. Perhaps adding another 4 GB of memory and an SSD rather than a HDD would serve me just as well as a new laptop...
tol <- expand.grid(mtry=1:3,
    nodesize=c(3,5,10))
bomen <- mclapply(seq(1:nrow(tol)),function(i)
          randomForest(
              y=train[,1],
              x=train[,-1],
              ntree=50,
              mtry=tol$mtry[i],
              nodesize=tol$nodesize[i])
)

Final thoughts

New hardware could also bring GPU computing in the picture. But this seems not so easy. It is unclear to me if CUDA or OpenCL is preferable and neither seems particularly easy to use. Then again, I could minimize hardware usage, buy a chromebook with a decent screen do my stuff in the cloud. For now though, I will continue to investigate how extra cores can help me.

6 comments:

  1. "Perhaps adding another 4 GB of memory and an SSD rather than a HDD would serve me just as well as a new laptop"

    Nope. In your situation it is past time you upgrade. Even a single-threaded process will see a 2-10x speed up going from an old celeron to a broadwell/skylake. The only questions are whether to go with 2 or 4 cores, 8 or 16GB of RAM and size of SSD (though that will at least be upgradable in the future). If you are doing serious computation there is no reason to short on hardware, it is far cheaper than your time.

    ReplyDelete
    Replies
    1. If you want to do serious computation on a laptop you are looking at something like a "gamer" laptop, which run about $3000US and usually run Windows 7. You will probably want to dual-boot it with Linux, and then you're likely to find hassles with GPU drivers.

      Then there's the bandwidth issue. High-performance computing requires a reliable high-speed connection. That usually means *wired*; a coffee shop WiFi isn't going to do the job.

      My advice is to forget about doing HPC on a laptop and have a custom-built workstation made. I got my present rig about 2 years ago. It has a near top-end ATI GPU, 32 GB of RAM, a 128 GB SSD, a terabyte hard drive and an 8-core AMD processor. That cost about $1500US. If I had wanted to spend the money I could have gotten a 12-core Intel processor for $2600.

      Delete
    2. Oh, yeah - if you do go the gamer laptop route you'll want a chiller pad - those things draw a lot of current and generate a lot of heat in a confined space.

      Delete
    3. First of all thank you for your comment. I agree my time should be more precious than my computer. However, I am not using this laptop professionally. At work there is a quite different setup. For professional use I would probably follow M Edward Borasky and get a custom made workstation.
      So, in this not professional usage, there are relatively few occasions where my current laptop is not sufficient. Mostly when looking at data mining and big data. In addition, it seems that next year might bring improvements in processors from both Intel and AMD. Buying this year means not buying next year. Hence the idea that I can work this one a bit longer.

      Delete
  2. Look into Teraproc. They make a cloud instance of R with RStudio that is already highly tuned for multi-core and GPU computing. It runs off of Amazon, but they auto-configure everything--even a small cluster if you want it including making all the connections. Buying a super gaming rig like mentioned above costs $2600 bucks--but renting one costs about 50 cents an hour.

    ReplyDelete
    Replies
    1. Budgeting for cloud computing is a non-trivial exercise. Just about everyone I know who's run big jobs in the cloud at "50 cents an hour" has run up huge bills because they either accidentally left something running or underestimated the required resources to successfully complete a task.

      Remember, you pay for the run whether it was successful or not. You pay for the storage space even when it's not running and you pay for the data transferred in and out.

      Spend the money - get a workstation. Mine typically last five years although I kept the old one running for seven. Look at it this way:

      1. Five years is (365+366+365+365+365)*24 hours - that's 43824 hours.
      2. At 50 cents an hour that's $21912US! Put that on a credit card and watch the interest flow from your wallet to the bank.

      Delete