Code ULFs on 480, not 295, ideas?

My code has been running on Cuda 3.0 and 195 pretty much since their respective first beta versions were released, without any problems.
Today I got my first GTX480 cards and my code dies with an unspecified launch failure. I looked in the programming guide hoping to find “Things to keep in mind when migrating from GTX2xx → GTX4xx”, but didn’t find anything interesting.

Has something similar hit anyone else?
When they get freed up, I’m gonna take one 295 and put in the new computer, to rule out the other hardware.

One kernel in hoomd exhibited this behavior. I haven’t had time to look further into it yet.

I remember maximum register number per thread was reduced to 64, but your code probably uses less registers. Also did you include ptx version in exe file?

Out of bounds shared memory accesses will cause a ULF on 480 but not 295.

(so if you’re doing any of the stupid nonsense to get full 16k of shmem on GT200, yep, that’s what’s doing it. I told you guys not to do it!)

Thats a request from the wishlist thread I think :) Very nice feature…

Btw, another reason of bad compilation, paramters now are in constant memory. I do not know if you used all constant memory, what will happen. Probably failure.

ptxas info : Used 57 registers, 320+0 bytes lmem, 4328+16 bytes smem, 52 bytes cmem[0], 256 bytes cmem[1], 8 bytes cmem[14]

So I’m not using all shared memory. Getting close on the registers though.
Not sure about the other numbers, am I using all my constant memory? There is an awful lot of parameters to the kernel, so I don’t rule that out.

Also did you include ptx version in exe file
I’m not sure what that means?

Out of bounds shared memory accesses will cause a ULF on 480 but not 295.

I’m gonna guess this is it them. I’ll be testing some more to verify things get right, but if I remove the one code path which uses a lot of shared memory, and replace it with non-shared, the ULF goes away.

Sorry, I thought you got an error with precompiled kernels.

I’m not doing anything of the sort in hoomd. The kernel that fails only uses 24 bytes of dynamically allocated shared memory for storing commonly used coefficients.

The kernels that do not fail store

float float float2 in shared memory

The kernels that do fail store

float float float4 in shared memory.

Both are addressed like the programming guide recommends in section B.2.3: (its actually the same kernel, templated by parameter type and computation)

extern __shared__ char array[];

...

float *array0 = (float*)(&array[0]);

float *array1 = (float*)(&array[4]);

float2 *array2 = (float2*)(&array[8]);  // kernel works fine

// float4 *array2 = (float4*)(&array[8]); // kernel ULFs

In the general kernel, each array(0,1,) will be N elements long, but even this degenerate case with a single float float float4 causes ULFs. And yes, I triple checked that the correct value is being set for shared memory in the kernel launch configuration.

I find that if I rearrange the order of the arrays to float4 float float, the ULF goes away!

extern __shared__ char array[];

...

float4 *array2 = (float*)(&array[0]);

float *array0 = (float*)(&array[16]);

float *array1 = (float*)(&array[20]);

Is there some new memory alignment rule for vector loads from shared memory? I don’t recall seeing that in the guide. For that matter, I am following the guide’s recommendation to the letter for assigning dynamically allocated shared memory arrays, except with floats and float4s instead of shorts, floats and ints.

Do you need a minimal repro case?

I think we hit that internally the other day, so it should be fixed in an upcoming release.

I take back that I got this because of the shared memory.
If I replace my “old” 295 with a 480, I get strange memory issues.
If I put the 480s in a computer with all new hardware, everything works.

Haven’t dug into the details since the fix is easy for me.