I’ve been looking at this off and on for about a week, and have gotten a test program to run a standard CPP approach on a dataset and then CUDA version for benchmarking purposes. Sadly, the CPP version runs faster:(
What bothers me, though, is that even under device emulation, the two runtimes hold almost constant. Of course, the standard CPP version wouldn’t change, but if I have 256 threads per block on a 2 GPU card, shouldn’t it run faster than it would under device emulation? Is this likely because the algorithm I’m using on the card runs faster in standard CPU and the overhead is minimal to start all of those threads? or does it mean that I’m doing something wrong, and the card is never actually being used.
Running 256 threads on the GPU and probably even the CPU is going to be within the margin of error for whatever timing you’re doing. A lot of my CUDA kernels spawn tens of millions of threads, so that should give you some idea of what a normal workload is.
More details from your side would be appreciated. For example, how big a portion of your application are you moving to the GPU? What kind of processing is it? What do the kernels look like? Are you running into terribly incoherent branches/uncoalesced accesses? I’ve heard that there are some applications that don’t give a lot of speedup compared to a well tuned CPU implementation.