decision tree classifier in CUDA.. some doubts

Hello everyone,

I am doing my project in field of Data mining where I am using CUDA technology to speed up the classification of data sets. The aim is to implement parallel version of decision tree classifier used in the classification in order to achieve higher performance compared to primitive sequential implementation techniques. I have implemented it but not getting improved performance in terms of speed. I also used visual profiler but could not figure out what is going wrong.

Here are some implementation details :

** Kernel 1 : The Decision tree classifier basically executed recursively and builds a tree structure based on some criteria called as Attribute selection method. I implemented this attribute selection method parallely, that means this method is executed along multiple threads where number of threads is equal to number of attributes. Attribute selection method I used is ID3.

**Kernel 2 : Also in the algorithm there are steps where we need to count total number of instances per class. This counting, I implemented in parallel so number of threads is equal to number of instances. Here in this case I used atomicAdd() to increment counter variable.

** Depending upon the results from attribute selection method, the dataset is further partitioned into multiple subsets and same algorithm is called recursively for each subset. I implemented this partitioning in two ways. Initially whole dataset is copied from host memory to device memory.
1. Partitioning is carried out on CPU and every time subset is copied to device memory
2. Partitioning is carried out on GPU by calling kernel functions with execution configuration as <<1,1>>

**The datasets I used downloaded from UCI machine learning repository :
Poker Hand Dataset : 11 attributes and two datasets with 25010 and 1,000,000 instances
SPECT Heart Dataset : 23 attributes and 267 instances
Optical Recognition of Handwritten Digits Dataset : 65 attributes and 3823 instances

**Results :

for GPU Implementation 1
Poker Hand (25010 inst.) 2.86 sec
Poker Hand (1000000 inst.) 162.83 sec
SPECT Heart 0.23 sec
Optical Recognition of Handwritten Digits 1.14 sec

for GPU Implementation 2
Poker Hand (25010 inst.) 10.18 sec
Poker Hand (1000000 inst.) 490 sec
SPECT Heart 0.18 sec
Optical Recognition of Handwritten Digits 6.34 sec

for CPU Implementation
Poker Hand (25010 inst.) 0.22 sec
Poker Hand (1000000 inst.) 8.85 sec
SPECT Heart 0.003 sec
Optical Recognition of Handwritten Digits 0.22 sec

** System Configuration :
GPU - GeForce GTX 280
CUDA Version 3.1
CPU - Dual Core AMD Opteron
RAM - 3.5 GB
OS - Windows XP SP2
Visual Studio 2008

I have certain doubts in my implementation in CUDA which I wanted to clarify with you such as

  1. I use global memory to accommodate large dataset, can I use shared memory for this purpose? As per my knowledge the latency for accessing global memory is more than that of shared memory.
  2. Is use of atomicAdd() affecting performance?
  3. Is recursive nature of algorithm affecting performance?
  4. It seems that second implementation is better than first one but results are different.

I would be very thankful if you could guide me for alternative changes in the design of algorithm.

At least at a high level, a decision tree is not a good match for GPUs. I think you’ll have trouble beating your CPU implementation because of the recursive tree-based structure of the build (training) algorithm. SVMs and neural networks are two machine learning/data mining algorithms that are much more naturally suited for GPUs because they involve mostly matrix operations.

With that said, perhaps you could have success with the parallelizing the classification step of the decision tree, as you can run many test examples in parallel. Additionally, take a look at some of the work on implementing kd-trees on GPUs for some inspiration on how to accelerate the build/training step. I’m sure you can at least improve your implementation, and maybe even beat the cpu implementation.

good luck!

Disclaimer: I’m not familiar with the field and only have a vague idea of what you are doing.

How many attributes (=threads) do you have? Unless there are tens of thousands of these, you should find another way to extract more parallelism from the problem.

Also it seems to me that, with the current implementation, you will hardly get any coalesced memory accesses. This will need a detailed look at what you are doing though.

To answer your specific questions:


Yes, shared memory has both better bandwidth and latency. Given it’s limited size it might be difficult to employ it in your problem. Again this would need a detailed look.

Yes, atomicAdd() is slow. It got quite a bit faster on compute capability 2.x devices though.

A recursive algorithm itself should not be a problem (given that you have an 1.x device that does not support recursion, I assume you have implemented it by hand?). It might however lead to bad, uncoalesced memory access patterns.

Do any of the results coincide with the CPU implementation?

It seems to me that this is a problem that does not fit very well to CUDA. You might find some improvements by moving to a compute capability 2.x device where the cache for global memory and faster atomics might mitigate the problems.

Overall I think you need to find a better way to extract useful parallelism, which I understand is difficult if you are new to CUDA. Maybe you can show us some code?

I have one more doubt…
here I shown the results that I compared between complete CUDA application and CPU implementation.CUDA implementation includes data transferring between host and device memory several times. So I guess, that transferring time is much more. Because today I just measured the time required for kernel execution only and corresponding CPU implementation. I got good results.
My question is, whenever we compare GPU and CPU application, should we compare only parallel part or complete application? I mean, should we take into account data transferring between memories or just compare kernel part and its serial version?

Did you insert a call to cudaThreadSynchronize() before stopping the timer? If not, you’ve only measured the time to launch the kernel, not its full execution time.

OK I got it, thanks for this correction. Now I put cutStopTimer() after cudaThreadSynchronize(). But is it acceptable if I compare only kernel execution time with the part of serial algorithm that I implemented in parallel in this kernel? That was my question in last reply…


Think of what your overall goal is. If you need the results on the CPU, then include the transfer time. If you are able to directly use them for whatever you need them for on the GPU, don’t.

There is a facility in C to measure time with clock(). I used this for both serial CPU version and parallel GPU version and got much more difference in execution time of both the versions. As compared to CUDA timer cutCreateTimer(), this cloak() timer shows much more difference…
Question: Which timer is more accurate or I mean should be preferred?