Improving Cuda-kernels performance

I’ve just written my first kernel.
I took a function I otherwise use which is somewhat cpu heavy, and made it into a kernel.
I benchmarked the functions with 1000 runs, and the old CPU version takes about 2 seconds, the Cuda enabled version takes 30 seconds to complete.
I was like O_o?
The function is somewhat “control-code-intensive”, so maybe it’s just better suited to be run on a CPU, but I really didn’t expect it to execute that much slower.

How/where do I begin looking for the bottle-neck?
What are the common n00b-mistakes that I could have done?

It’s not really relevant to me if this function is well suited for a cuda device, I’m just trying to learn the tricks :)


Examine your memory access pattern carefully. Uncoalesced memory reads/writes can slow performance by a factor of 10 or more easily.

  1. Comparing just a couple threads launched on a GPU vs the CPU. Due to the interleaved execution, a GPU doesn’t really hit full operating efficieny unless you are running ~10,000 independent threads.

  2. Counting host<->device transfer time in your program. Yes, to make a “fair” comparison in the end you may need to include this, but it is really, REALLY slow. So if you are trying to test the performance of a kernel you need to leave it out of the timing. Additionally, the fastest CUDA application push every step onto the GPU so host<->device copies are no longer needed except at initialization.

  3. Poor memory access pattern (mentioned above)

  4. Poor block configuration (i.e. blocks of 1 thread execute in the same time as blocks of 32)

These are the biggest mistakes I’ve seen made in “What! GPU is slower than the CPU!” threads. Just remember, the programming guide is your friend. It has everything in it you need, including many performance guidelines, to write optimal CUDA code. If you don’t learn well from manuals, then check out the FAQ: it has links to other great resources to get the same information in a classroom format.

Thanks a lot for your reply :)

  1. This is part of my problem without a doubt, the problem i tried on isn’t easily parallelized, but I still didn’t expect such an increase. Will move on to a better problem

  2. I did one large copy H=>D first, and then I copied the result (3 floats) back in each cycle. Maybe I should have tried without.

  3. I have absolutely no idea what a “good access pattern” is, so I will need to read up on that. Is the profiler the tool for examining memory access?

  4. Will keep that in mind too :)

Will read over the entire programming guide once again, only skimmed it first time.


The “usual” mistake is to include cudaMalloc/cudaMemcpy of the whole big array each time (cudaMalloc is especially expensive). Copying just 3 floats back will only change the timing by ~20 microseconds, so that isn’t so bad.

Coalesced = good. The programming guide has lots of examples and all the rules defining coalescing. They will take a while to sink in, but you will eventually get the hang of it. We’ll always answer questions here, too: especially if you’ve read the guide first and are coming here for clarification.

If you absolutely cannot coalesce (i.e. semi-random access pattern), then there are other ways (textures, constant memory, shared memory staging). I can’t give any rules of thumb because there are so many options and the best must be determined on a case by case basis.

Well, I didn’t do THAT :) Atleast I know the problem isn’t with memory copying then, as 1000 iterations means ~20 milliseconds of my 30 seconds, so the problem is the kernel, one way or the other.

Ofc. I often answer questions in gentoo’s forums when I can, and I know how much better “treatment” one tends to give to well (in)formed questions. :)

I’m glad to know that there is an amount of flexibility, so that Cuda can be applied to a wise variety of problems. That increases my chances of success :)

Thank you again for you time.

The shared memory space maps to the scratchpad memory of the GPU, and is local to each thread
block. The texture memory space uses the GPUs texture caching and ltering capabilities, and
is best utilized with data access patterns exhibiting 2-D locality. More detailed information about
GPU architecture and how features of the CUDA model affect application performance is presented