I’ve been looking at this off and on for about a week, and have gotten a test program to run a standard CPP approach on a dataset and then CUDA version for benchmarking purposes. Sadly, the CPP version runs faster:(
What bothers me, though, is that even under device emulation, the two runtimes hold almost constant. Of course, the standard CPP version wouldn’t change, but if I have 256 threads per block on a 2 GPU card, shouldn’t it run faster than it would under device emulation? Is this likely because the algorithm I’m using on the card runs faster in standard CPU and the overhead is minimal to start all of those threads? or does it mean that I’m doing something wrong, and the card is never actually being used.
Thanks for any tips.