I’ve just written my first kernel.
I took a function I otherwise use which is somewhat cpu heavy, and made it into a kernel.
I benchmarked the functions with 1000 runs, and the old CPU version takes about 2 seconds, the Cuda enabled version takes 30 seconds to complete.
I was like O_o?
The function is somewhat “control-code-intensive”, so maybe it’s just better suited to be run on a CPU, but I really didn’t expect it to execute that much slower.
How/where do I begin looking for the bottle-neck?
What are the common n00b-mistakes that I could have done?
It’s not really relevant to me if this function is well suited for a cuda device, I’m just trying to learn the tricks :)
Examine your memory access pattern carefully. Uncoalesced memory reads/writes can slow performance by a factor of 10 or more easily.
Comparing just a couple threads launched on a GPU vs the CPU. Due to the interleaved execution, a GPU doesn’t really hit full operating efficieny unless you are running ~10,000 independent threads.
Counting host<->device transfer time in your program. Yes, to make a “fair” comparison in the end you may need to include this, but it is really, REALLY slow. So if you are trying to test the performance of a kernel you need to leave it out of the timing. Additionally, the fastest CUDA application push every step onto the GPU so host<->device copies are no longer needed except at initialization.
Poor memory access pattern (mentioned above)
Poor block configuration (i.e. blocks of 1 thread execute in the same time as blocks of 32)
These are the biggest mistakes I’ve seen made in “What! GPU is slower than the CPU!” threads. Just remember, the programming guide is your friend. It has everything in it you need, including many performance guidelines, to write optimal CUDA code. If you don’t learn well from manuals, then check out the FAQ: it has links to other great resources to get the same information in a classroom format.
The “usual” mistake is to include cudaMalloc/cudaMemcpy of the whole big array each time (cudaMalloc is especially expensive). Copying just 3 floats back will only change the timing by ~20 microseconds, so that isn’t so bad.
Coalesced = good. The programming guide has lots of examples and all the rules defining coalescing. They will take a while to sink in, but you will eventually get the hang of it. We’ll always answer questions here, too: especially if you’ve read the guide first and are coming here for clarification.
If you absolutely cannot coalesce (i.e. semi-random access pattern), then there are other ways (textures, constant memory, shared memory staging). I can’t give any rules of thumb because there are so many options and the best must be determined on a case by case basis.
The shared memory space maps to the scratchpad memory of the GPU, and is local to each thread
block. The texture memory space uses the GPUs texture caching and ltering capabilities, and
is best utilized with data access patterns exhibiting 2-D locality. More detailed information about
GPU architecture and how features of the CUDA model affect application performance is presented