local memory loads limited to 64b gld transfers for CC>1.1 <2.0?

Hello Folks,

I ask you on a topic concerning memory transactions from gloabl to local memory on devices with compute capability between 1.2 and 2.0.

Analyzing the following code with OpenCL Code with NVidia Visual Profiler shows me, that only 64b gld transfers are used. I wonder why no 128b gld transfers are used, as I dont see, why there should be a restriction. So the general question is: Why dont I get 128b gld transfers for this example code:

__kernel void BufferAlignment( __global float* buf )

{

    //  buf points to a buffer of 32*4 Byte, thats 32 floats, 

    //  which can in principle can be loaded in a 128b global load

	int lid0 = get_local_id(0);

	__local float locBuffer[32];

locBuffer[ lid0 ] = buf[ lid0 ];

}

The kernel is enqueued by enqueueNDRangeKernel for one thread block of 32 threads.

The Profiler now tells me that it uses two 64b gld transaction instead of one 128b gld. And now please tell me why?

I deliver a first attempt to explain my theory: Each half-warp of 16 threads requests 16 floats, that is 64 Byte. The devices I use have a 16 bank local memory. Now all 16 writes into local memory occur at the same time when the 64Byte are loaded. The second half-warp works then the same.

But why doesnt it work like this: A 128b gld loads all 32 floats at once, and each half-warp then write its protion.

Can anybody help me understand this please.

Thanks.

Hello Folks,

I ask you on a topic concerning memory transactions from gloabl to local memory on devices with compute capability between 1.2 and 2.0.

Analyzing the following code with OpenCL Code with NVidia Visual Profiler shows me, that only 64b gld transfers are used. I wonder why no 128b gld transfers are used, as I dont see, why there should be a restriction. So the general question is: Why dont I get 128b gld transfers for this example code:

__kernel void BufferAlignment( __global float* buf )

{

    //  buf points to a buffer of 32*4 Byte, thats 32 floats, 

    //  which can in principle can be loaded in a 128b global load

	int lid0 = get_local_id(0);

	__local float locBuffer[32];

locBuffer[ lid0 ] = buf[ lid0 ];

}

The kernel is enqueued by enqueueNDRangeKernel for one thread block of 32 threads.

The Profiler now tells me that it uses two 64b gld transaction instead of one 128b gld. And now please tell me why?

I deliver a first attempt to explain my theory: Each half-warp of 16 threads requests 16 floats, that is 64 Byte. The devices I use have a 16 bank local memory. Now all 16 writes into local memory occur at the same time when the 64Byte are loaded. The second half-warp works then the same.

But why doesnt it work like this: A 128b gld loads all 32 floats at once, and each half-warp then write its protion.

Can anybody help me understand this please.

Thanks.

Loads and stores are always done with Half-Warp granularity, on CC < 2.0 at least. Not sure about 2.0+.

Loads and stores are always done with Half-Warp granularity, on CC < 2.0 at least. Not sure about 2.0+.