Fastest memory for variable data

I am currently programming an application based on CUDA. The application is combined with OpenGL

and the kernel should be executed about 60 times per second.

In my first approach I used constant memory for the data and kernel took about 1.5ms (execution time).

But I didn’t embrace that the data are variable. So my second approach was to put the data into global

memory and I ran into a big performance problem. The kernel took about 36ms.

My kernel configuration contains 80x60 blocks and each block contains 8x8 threads. Each thread accesses

the same variable data.

Therefore, I put the data into shared memory.

__gobal__ void kernel()

{

	extern __shared__ ViewingSystem vSysMem[];

	if (!threadIdx.x && !threadIdx.y)

		vSysMem[0] = *vSys;

	__syncthreads();

	...

}

ViewingSystem is a struct with some floats, float3s and ints.

With this, the exectuion time decreased to 13ms. This is, however, too slow.

My question: What is the best type of memory, if ALL threads accesses the same

variable data. What about the usage of a texture? I have to regenerate the texture

each frame, but texture access are fast.

[quote name=‘kdahm666’ post=‘587203’ date=‘Sep 12 2009, 01:53 AM’]

I am currently programming an application based on CUDA. The application is combined with OpenGL

and the kernel should be executed about 60 times per second.

In my first approach I used constant memory for the data and kernel took about 1.5ms (execution time).

But I didn’t embrace that the data are variable. So my second approach was to put the data into global

memory and I ran into a big performance problem. The kernel took about 36ms.

My kernel configuration contains 80x60 blocks and each block contains 8x8 threads. Each thread accesses

the same variable data.

Therefore, I put the data into shared memory.

[codebox]if (!threadIdx.x && !threadIdx.y)
	vSysMem[0] = *vSys;

[/codebox]

based on your configuration, 64 threads per block, you have 63 threads ( threadIdx.x != 0 and threadIdx.y != 0)

would execute command "vSysMem[0] = *vSys; " simultaneously.

this is a conflict.

No :) The statement is equal to:

if (threadIdx.x == 0 && threadIdx.y == 0)

Which means only the first thread of a block will write.

Exact. My idea was, that the first thread of each block writes the necessary information to shared mem.

All other threads of the blocks just read the information