I am doing my project in field of Data mining where I am using CUDA technology to speed up the classification of data sets. The aim is to implement parallel version of decision tree classifier used in the classification in order to achieve higher performance compared to primitive sequential implementation techniques. I have implemented it but not getting improved performance in terms of speed. I also used visual profiler but could not figure out what is going wrong.
Here are some implementation details :
** Kernel 1 : The Decision tree classifier basically executed recursively and builds a tree structure based on some criteria called as Attribute selection method. I implemented this attribute selection method parallely, that means this method is executed along multiple threads where number of threads is equal to number of attributes. Attribute selection method I used is ID3.
**Kernel 2 : Also in the algorithm there are steps where we need to count total number of instances per class. This counting, I implemented in parallel so number of threads is equal to number of instances. Here in this case I used atomicAdd() to increment counter variable.
** Depending upon the results from attribute selection method, the dataset is further partitioned into multiple subsets and same algorithm is called recursively for each subset. I implemented this partitioning in two ways. Initially whole dataset is copied from host memory to device memory.
1. Partitioning is carried out on CPU and every time subset is copied to device memory
2. Partitioning is carried out on GPU by calling kernel functions with execution configuration as <<1,1>>
**The datasets I used downloaded from UCI machine learning repository :
Poker Hand Dataset : 11 attributes and two datasets with 25010 and 1,000,000 instances
SPECT Heart Dataset : 23 attributes and 267 instances
Optical Recognition of Handwritten Digits Dataset : 65 attributes and 3823 instances
for GPU Implementation 1
Poker Hand (25010 inst.) 2.86 sec
Poker Hand (1000000 inst.) 162.83 sec
SPECT Heart 0.23 sec
Optical Recognition of Handwritten Digits 1.14 sec
for GPU Implementation 2
Poker Hand (25010 inst.) 10.18 sec
Poker Hand (1000000 inst.) 490 sec
SPECT Heart 0.18 sec
Optical Recognition of Handwritten Digits 6.34 sec
for CPU Implementation
Poker Hand (25010 inst.) 0.22 sec
Poker Hand (1000000 inst.) 8.85 sec
SPECT Heart 0.003 sec
Optical Recognition of Handwritten Digits 0.22 sec
** System Configuration :
GPU - GeForce GTX 280
CUDA Version 3.1
CPU - Dual Core AMD Opteron
RAM - 3.5 GB
OS - Windows XP SP2
Visual Studio 2008
I have certain doubts in my implementation in CUDA which I wanted to clarify with you such as
- I use global memory to accommodate large dataset, can I use shared memory for this purpose? As per my knowledge the latency for accessing global memory is more than that of shared memory.
- Is use of atomicAdd() affecting performance?
- Is recursive nature of algorithm affecting performance?
- It seems that second implementation is better than first one but results are different.
I would be very thankful if you could guide me for alternative changes in the design of algorithm.