BUG? nvcc compiler unnecessary splits 8 bytes into 2 4 byte ones

sergeyn · May 24, 2009, 5:44pm

Hi,

Take a look at the following code:

__global__ void func(..., float2 * m, int w, ...)

{

  ...

  for (i += n; i < w; i += n)

  {

	float2 t = m[i];

  ...

Loading 8 bytes in a loop - sounds simple. But when investigating -ptx output, to my surprise, instead of loading 8 bytes (ld.global.v2.f32) compiler generated two 4 byte loading instructions ld.global.f32.

Playing around with the code I managed to make nvcc do what I want, following code works as expected - 1 8 byte load instruction in a loop:

__global__ void func(..., float2 * m, int w, ...)

  ...

  for (i += n, m += n; i < w; i += n, m += n)

  {

	float2 t = *m;

  ...

This behavior seems like a bug to me.

sergeyn · May 24, 2009, 10:24pm

Another thing I can’t workaround for now:

__global__ void func(float4* _o)

{

  float4 * volatile o = _o+blockIdx.y;

  *o = make_float4(0,0,0,0);

}

This code generates 4 32bit stores, instead of 1 128bit store.

MisterAnderson42 · May 25, 2009, 11:00am

It depends on when you use t.x and t.y. If there is a lot of code produced between the usages of t.x and t.y, the compiler may decide to save a register and not load the 2nd half until later. It is unfortunate that this breaks coalescing, but one could imagine circumstances where there are enough computations between the two memory reads that there is actually a latency-hiding benefit to split them.

I’ve found that using the volatile keyword on the read register:

for (i += n; i < w; i += n)

	{

	volatile float2 t = m[i];

will usually convince the compiler to load the full type in one read.

As for the issue with stores, I’ve never seen the compiler break them up before.

What if you do something simple:

float4 zero = make_float4(0,0,0,0);

*o = zero;

?

sergeyn · May 25, 2009, 11:06am

I tried that and another half a million variants and I still can’t make it work. I’ll try getting in touch with someone from nvidia about this issue.

cvnguyen · May 25, 2009, 11:29am

I assume that you are accessing global memory. Personally I think this is an example of the problem that CUDA does not recognize the destination type of the pointer. Could you try to declare the pointer explicitly like this:

__device__ float2 *m;

__global__ void func(..., float2 * m_arg, int w, ...)

{

  m = (__device__ float2 *)m_arg;

  ...

  for (i += n; i < w; i += n)

  {

	float2 t = m[i];

  ...

sergeyn · May 25, 2009, 11:37am

You get a warning “a storage class may not be specified here”, so it is probably ignored.