typecasting and uncoalesced memory

I have a simple kernel that is causing uncoalesced memory loads that are dramatically slowing down my program. I know that I currently have problems due to loading int’s. In the following code left and right are floats and consist is a float.

I am using CUDA 2.0

const int x = __mul24(blockIdx.x,blockDim.x) + threadIdx.x;

const int y = __mul24(blockIdx.y,blockDim.y) + threadIdx.y;

const int i = __mul24(y,width)+x;

	int dl = (int)left[i];

	int dlr = (int)right[i-dl];


      left[i] = 0;

so I tried changing the code to inline some calculations so that I am not storing to int’s

if(abs(left[i] - right[i-left[i]])>consist)

but this generates a compiler error since left[i] is a float. So I tried typecasting:

if(abs(left[i] - right[i-(int)left[i]])>consist)

But this still has the same number of uncoalesced loads as above. Just to make sure I had identified the location of the problem I tried:

if(abs(left[i] - right[i])>consist)

Resulting in 0 uncoalesced loads.

I then also tried type casting as mentioned in the prgramming guide by using:

if(abs(left[i] - right[i-__float_as_int(left[i])])>consist)

This resulted in an unspecified kernel launch error.

Why would inline typecasting result in uncoalesced memory reads? And does anybody have any suggestions on how to do this without having this problem? And why would I get an error with __float_as_int? If I do __float_as_int(5.0f) I do not get any errors.


Typecasting does not result in uncoalesced reads, you had them when the cast was out-of-line. Subtracting some semi-random value does.

Review the Global Memory section of the Performance Guidelines chapter of the CUDA Programming Guide. Any of the figures showing “examples of non-coalesced global memory access patterns” could be happening in your code.