Why slow kernal that does nothing?

Cudabean · August 24, 2009, 10:57am

I have an application that simply loops and calls a kernel. Trying to improve the speed of my test app, I found that merely changing the size of a local memory array dramatically effects the speed of the kernel. Cutting out all the code except some of the initialization, I could still reproduce this slowness. I would expect a kernel like this would not be influenced by the array size and would run lightning fast as it effectively does nothing.

What is going on here?

__global__ void kernel_test()

{

  float4 memLocal1[600];	 // <--- app slows the larger this value is eg. [6] = ~400fps [600] = ~40fps

  __shared__ float4* ptr1;   // <--- ptr and memory must be different eg. Shared and Local.  if they are the same, they are fast (or generate no code?)

  ptr1 = memLocal1;

}

mcleary · August 24, 2009, 11:00pm

I have an application that simply loops and calls a kernel. Trying to improve the speed of my test app, I found that merely changing the size of a local memory array dramatically effects the speed of the kernel. Cutting out all the code except some of the initialization, I could still reproduce this slowness. I would expect a kernel like this would not be influenced by the array size and would run lightning fast as it effectively does nothing.

What is going on here?
__global__ void kernel_test()

{

  float4 memLocal1[600];	 // <--- app slows the larger this value is eg. [6] = ~400fps [600] = ~40fps

  __shared__ float4* ptr1;   // <--- ptr and memory must be different eg. Shared and Local.  if they are the same, they are fast (or generate no code?)

  ptr1 = memLocal1;

}

Well, what I can understand about your problem is that you couldn’t understand why are you getting this slowness. I think this occur because each thread of the kernel is allocating 600 * 12 bytes = 7kb. This is significant because you have n threads doing it. So, how big is this value more slower is your kernel.

Cudabean · August 25, 2009, 11:29am

What I understand is that each thread can use up to 16kb local scoped memory. The memory allocated here should be a compile time figure, it should not matter whether the value be small or large, so long as it does not exceed 16kb. This test kernel performs the same when 1 thread or 1000 threads are configured to run. What concerns me is that a real time application can be limited to low performance even when no useful work is being performed.

MisterAnderson42 · August 25, 2009, 12:10pm

your memLocal array will be allocated in local memory which is really just scoped global memory. Global memory is dynamically allocated, not at compile time. Internally, the CUDA runtime (or driver) must be doing the equivalent of cudaMalloc(local memory) before actually launching the kernel and cudaFree(local memory) after launching the kernel.

local memory is never a good idea if you care about performance. If you need that much memory allocated per thread, allocate it once yourself with cudaMalloc and index into it.

Cudabean · August 26, 2009, 4:11am

Thank you, I understand what you are saying. This scenario still does not make sense to me.

Sure global memory is the slowest device memory available, but the kernel shown here does not perform a single read or write to that memory.

The kernel shown can be run with <<<1,1>>> or <<<32,32>>> or other values and it performs the same.

Sure run-time allocation may be slow, but I am allocating a single array. The confusing part is why the performance of this tiny dummy kernel varies linearly with the size value of the array.

The code shown should compile and run as described but a real kernel might use local scoped (global) memory to store a recursion stack while using shared memory to cache some top levels of the stack to improve performance. (In a real implementation, each thread may use a portion of an array of shared memory unless the stack was not maintained by all threads.) I am just giving a real example why such code could exist.

As far as performing cudaMalloc and Free outside the kernel, I regularly do this as I’m sure it is bad practice to hold on to video memory when it is not needed. I typically allocate scratch buffers before and free them directly after kernel execution. I find no performance penalty for this. I will however manually allocate some global memory in place of the local memory shown in this code, just to see how it performs.

Topic		Replies	Views
temporary memory issues CUDA Programming and Performance	11	5419	March 30, 2008
Question about variables inside a kernel CUDA Programming and Performance	5	2410	January 22, 2008
Thread Local variable CUDA Programming and Performance	1	1682	September 23, 2009
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5288	September 6, 2008
__local (Shared) memory increases kernel execution time? CUDA Programming and Performance	9	21614	November 30, 2010
__shared__ memory confused me. __shared__ memory CUDA Programming and Performance	7	4047	August 1, 2009
Why does a simple single-threaded CUDA kernel consume large amounts of global memory? CUDA Programming and Performance	7	6635	February 24, 2011
Local memory size CUDA Programming and Performance	8	7878	November 14, 2008
slow local arrays in GLSL OpenGL	0	2624	April 23, 2013
Code-> CUDA assessment CUDA Programming and Performance	5	2074	September 8, 2008

Why slow kernal that does nothing?

Related topics