In the SDK project,there is one sample project called cppIntegration.I have calculated the time taken by the kernel function and normal cpp function using this method
Not every example in SDK must be computed faster on GPU. ;-) In this case, the amount of processed data is very low (array of 16 elements) so even with GeForce GTX 285 the kernel computation is slower on GPU then on CPU. I am not sure about the particular reason, maybe it’s because kernel call itself is more time consuming than (small) kernel computation. Maybe someone other can give a better and deeper explanation.
Dtto Sarnath. Use GPU whenever you process very large vectors or matrices, especially when you need to perform algorithms that have complexity worse than O(n), e. g. matrix multiplication. The cppIntegration has complexity O(n) and is performed on 16 elements vector – there the GPU is useless because even moving data from CPU to GPU and kernel initialization costs much more time than computation on CPU.
I have one more question regarding this.I have a small data set say an array of 100 but inside that FOR loop there are like 50 more calculations and lots of arithematic operation like + , - , / , *,find square roots etc…So is This OK to port these kinds of code to CUDA GPU ?
If you choose to expose the parallelism in terms of 100 elements – then you have just 100 threads — i.e. pretty less number to do on GPU.
If you choose to expose the parallelism present in the FOR loop along with the 100 elements – then you have FOR_LOOP_ITERATIONS*100 – which could be a good number for the GPU.
The main criterion for parallelizing FOR loops is that there must NOT be inter-dependence between iterations. If an iteration depends on results of previous iterations - they cannot be expressed as parallel threads…
You may want to think about this and choose the parallel perspective that will feed the GPU elephant correctly!
i am running into this similar issue here, i have several arrays (6) but they are all 3x3 arrays, and so computation on the GPU seem to be VERY slow compared to CPU. The results come out matching the CPU but much much slower.
Am I better off not using CUDA here? Or are there other techniques for me to speed up CUDA to the point that it can outperform the CPU?