Why slow kernal that does nothing?

I have an application that simply loops and calls a kernel. Trying to improve the speed of my test app, I found that merely changing the size of a local memory array dramatically effects the speed of the kernel. Cutting out all the code except some of the initialization, I could still reproduce this slowness. I would expect a kernel like this would not be influenced by the array size and would run lightning fast as it effectively does nothing.

What is going on here?

__global__ void kernel_test()

{

  float4 memLocal1[600];	 // <--- app slows the larger this value is eg. [6] = ~400fps [600] = ~40fps

  __shared__ float4* ptr1;   // <--- ptr and memory must be different eg. Shared and Local.  if they are the same, they are fast (or generate no code?)

  ptr1 = memLocal1;

}

Well, what I can understand about your problem is that you couldn’t understand why are you getting this slowness. I think this occur because each thread of the kernel is allocating 600 * 12 bytes = 7kb. This is significant because you have n threads doing it. So, how big is this value more slower is your kernel.

What I understand is that each thread can use up to 16kb local scoped memory. The memory allocated here should be a compile time figure, it should not matter whether the value be small or large, so long as it does not exceed 16kb. This test kernel performs the same when 1 thread or 1000 threads are configured to run. What concerns me is that a real time application can be limited to low performance even when no useful work is being performed.

your memLocal array will be allocated in local memory which is really just scoped global memory. Global memory is dynamically allocated, not at compile time. Internally, the CUDA runtime (or driver) must be doing the equivalent of cudaMalloc(local memory) before actually launching the kernel and cudaFree(local memory) after launching the kernel.

local memory is never a good idea if you care about performance. If you need that much memory allocated per thread, allocate it once yourself with cudaMalloc and index into it.

Thank you, I understand what you are saying. This scenario still does not make sense to me.

Sure global memory is the slowest device memory available, but the kernel shown here does not perform a single read or write to that memory.

The kernel shown can be run with <<<1,1>>> or <<<32,32>>> or other values and it performs the same.

Sure run-time allocation may be slow, but I am allocating a single array. The confusing part is why the performance of this tiny dummy kernel varies linearly with the size value of the array.

The code shown should compile and run as described but a real kernel might use local scoped (global) memory to store a recursion stack while using shared memory to cache some top levels of the stack to improve performance. (In a real implementation, each thread may use a portion of an array of shared memory unless the stack was not maintained by all threads.) I am just giving a real example why such code could exist.

As far as performing cudaMalloc and Free outside the kernel, I regularly do this as I’m sure it is bad practice to hold on to video memory when it is not needed. I typically allocate scratch buffers before and free them directly after kernel execution. I find no performance penalty for this. I will however manually allocate some global memory in place of the local memory shown in this code, just to see how it performs.