Newbie: Am I actually running on the GPU? device emulation doesn't slow runtimes

I’ve been looking at this off and on for about a week, and have gotten a test program to run a standard CPP approach on a dataset and then CUDA version for benchmarking purposes. Sadly, the CPP version runs faster:(

What bothers me, though, is that even under device emulation, the two runtimes hold almost constant. Of course, the standard CPP version wouldn’t change, but if I have 256 threads per block on a 2 GPU card, shouldn’t it run faster than it would under device emulation? Is this likely because the algorithm I’m using on the card runs faster in standard CPU and the overhead is minimal to start all of those threads? or does it mean that I’m doing something wrong, and the card is never actually being used.

Thanks for any tips.

Running 256 threads on the GPU and probably even the CPU is going to be within the margin of error for whatever timing you’re doing. A lot of my CUDA kernels spawn tens of millions of threads, so that should give you some idea of what a normal workload is.

Hm, when I say 256, I guess I really meant per block.

In one case, the number of blocks is around 11,000, I think.

To know whether a GPU was used, you can try to use the following in your code:-

__global__ void myalgo(){

// code

#ifdef __DEVICE_EMULATION__

    printf("Not enough cores!\n");

#endif

// code

}

More details from your side would be appreciated. For example, how big a portion of your application are you moving to the GPU? What kind of processing is it? What do the kernels look like? Are you running into terribly incoherent branches/uncoalesced accesses? I’ve heard that there are some applications that don’t give a lot of speedup compared to a well tuned CPU implementation.