Local memory race condition? memory getting overwritten causing strange results

Ok experts, lets see if you can help with this then, here’s my code:

[codebox]device RGBColour reflectiveShade(ShadeRec& sr)

{

RGBColour L;

L = specularShade(sr);  // direct illumination



float3 wo, wi;	

RGBColour fr;

wo = -sr.ray.d;

Ray reflected_ray[3];

fr = reflectiveF(sr, wo, wi);

reflected_ray[0].o = sr.hitPoint;

reflected_ray[0].d = wi; 

RGBColour temp;

temp.r = 0; temp.g = 0; temp.b = 0;

for(int i = 0; i < vp.maxDepth -1; i++)

{

	sr = hitObjects(reflected_ray[i]);    

	

	if (sr.hitAnObject) 

	{

		sr.ray = reflected_ray[i];	

		wo = -sr.ray.d;

		fr = reflectiveF(sr, wo, wi); 

		reflected_ray[i+1].o = sr.hitPoint;

		reflected_ray[i+1].d = wi;

	}

	else

	{

		reflected_ray[i+1].d.x = 0.0f;

	}

}	

for(int i = 0; i < vp.maxDepth; i++)

{

	if(reflected_ray[i].d.x != 0.0f)

	{

		sr = hitObjects(reflected_ray[i]);    

		

		if (sr.hitAnObject) 

		{

			sr.ray = reflected_ray[i];	

			temp = specularShade(sr);  // direct illumination

			wo = -sr.ray.d;

			fr = reflectiveF(sr, wo, wi); 

			temp += max_to_one(fr * temp * (dot(sr.normal, wi)));

			L += max_to_one(fr * temp * (dot(sr.normal, wi)));

		}

		

	}

}		

return (L);	

}[/codebox]

I’m trying to store rays on a “stack” with an array of Rays being placed in local memory for each thread. But it seems each thread is using the same memory space for the ray array and its buggering up the values - this doesn’t happen in the emuDebug mode, so it obviously a race condition. How do I get the ray array to stay local to the thread and not global to all threads? Anyone got any tips/ advice?

Without your global function there is no way to tell what is wrong. There is nothing depending on threadIdx and blockIdx in this code.

Here’s the global function. I didn’t use threadIdx or blockIdx because I thought (and it should be) that reflected_ray[3] is local to each thread.

[codebox]global void render(int* g_odata)

{

RGBColour L;

Ray ray;

float2 pp;

ray.d.x = 0;

int x = blockIdx.x*blockDim.x + threadIdx.x;

int y = blockIdx.y*blockDim.y + threadIdx.y;

ray.o = camera.eye;



pp.x = x - 0.5 * vp.hres + 0.5; 

pp.y = y - 0.5 * vp.vres + 0.5;

ray.d = camera.getDirection(pp);

L = tracer.traceRay(ray);



g_odata[y*vp.hres+x] = rgbToInt(L.r * 255, L.g * 255, L.b * 255);

//__syncthreads();

}

//–ptxas-options=-v

void setupRaytracer(int pbo_in, int pbo_out, int width, int height, float cameraEye[3], float rotate[3], int rcMoveX, int rcMoveY, int prev_rcMoveX, int prev_rcMoveY, int move)

{

int* out_data;

dim3 block(16, 16, 1);

dim3 grid(width / block.x, height / block.y, 1);

CUDA_SAFE_CALL(cudaGLMapBufferObject( (void**)&out_data, pbo_out));

build(width, height, cameraEye, rotate, rcMoveX, rcMoveY, prev_rcMoveX, prev_rcMoveY, move);

render<<< grid, block>>>(out_data);

CUDA_SAFE_CALL(cudaGLUnmapBufferObject(pbo_out));

}[/codebox]

Well, that looks indeed good. And your array should be local.
So it goes wrong somewhere else.
Are you sure that the problem lies where you think it does? Outputting some intermediate values often really helps in determining where the error lies, I once spent 1,5 weeks looking for an error in a piece of code, where it turned out to the error was just before…

Thanks, but debugging won’t help in my case it seems because the emulation versions run perfectly and give the right output! Which makes me believe it is something to do with memory (reflected_ray[3]) being accessed concurrently on the device as in emulation mode everything is serialised, so this problem won’t become apparent. Can anyone else shed some light?

Can anyone help? Mister Anderson? You seem to be a genius in the field of CUDA!

Sorry, I wish I could help. My limited CUDA knowledge only covers those parts that I’ve used, and local memory is not one of them.

All I can really add to this discussion is to confirm that your use of local memory is correct and there really shouldn’t be any problems with it. The only thing that stands out as odd is that you loop over vp.maxDepth, but declare reflected_ray to be an array of length 3: are you sure that vp.maxDepth is <= 3???

You might also try decreasing the complexity of your code until you reach the simplest possible example that still exhibits the problem. That is often a helpful problem solving technique.

Well, I can tell you that if you output intermediate values from your kernel (in release mode) you might find out what is going wrong. The fact that emulation mode works ok does not exclude other errors from happening.

Hello, thanks for you input guys, I’ve tried using a shared array of [64][3] Rays, but I get the same problem, even though each thread still accesses its own set of rays:

[codebox]device RGBColour reflectiveShade(ShadeRec& sr)

{

RGBColour L;

L = specularShade(sr);  // direct illumination



float3 wo, wi;	

RGBColour fr;

wo = -sr.ray.d;

__shared__ Ray reflected_ray[64][3];

fr = reflectiveF(sr, wo, wi);

reflected_ray[threadIdx.y*blockDim.x + threadIdx.x][0].o = sr.hitPoint;

reflected_ray[threadIdx.y*blockDim.x + threadIdx.x][0].d = wi; 

RGBColour temp;

temp.r = 0; temp.g = 0; temp.b = 0;

for(int i = 0; i < vp.maxDepth -1; i++)

{

	sr = hitObjects(reflected_ray[threadIdx.y*blockDim.x + threadIdx.x][i]);    

	

	if (sr.hitAnObject) 

	{

		sr.ray = reflected_ray[threadIdx.y*blockDim.x + threadIdx.x][i];	

		wo = -sr.ray.d;

		fr = reflectiveF(sr, wo, wi); 

		reflected_ray[threadIdx.y*blockDim.x + threadIdx.x][i+1].o = sr.hitPoint;

		reflected_ray[threadIdx.y*blockDim.x + threadIdx.x][i+1].d = wi;

	}

	else

	{

		reflected_ray[threadIdx.y*blockDim.x + threadIdx.x][i+1].d.x = 0.0f;

	}

}	

for(int i = 0; i < vp.maxDepth; i++)

{

	if(reflected_ray[threadIdx.y*blockDim.x + threadIdx.x][i].d.x != 0.0f)

	{

		sr = hitObjects(reflected_ray[threadIdx.y*blockDim.x + threadIdx.x][i]);    

		

		if (sr.hitAnObject) 

		{

			sr.ray = reflected_ray[threadIdx.y*blockDim.x + threadIdx.x][i];	

			temp = specularShade(sr);  // direct illumination

			wo = -sr.ray.d;

			fr = reflectiveF(sr, wo, wi); 

			temp += max_to_one(fr * temp * (dot(sr.normal, wi)));

			L += max_to_one(fr * temp * (dot(sr.normal, wi)));

		}

		

	}

}		

return (L);	

}[/codebox].

By the way, how am I meant to output intermediary values in release mode? Obviously printf’s won’t work and breakpoints won’t…

Usually when something is working in emu but not in release, there’s some host-mem pointer being passed to the kernel.
Apart from that, how do you figure that the same local-mem is being used for all kernels, or that there is a problem with the local-mem to begin with?

To output intermediary values, write them to an array, dma them down into host-mem and print them from there.

But I’m pretty sure there no host pointer mishaps as I have several other shade functions and they work fine. Further more, if I choose to use one ray (i.e. no array of rays) then the image gets rendered correctly, so what does this say? Perhaps not a problem with concurrent writing but a problem with arrays?

And also, to output intermediary values to an array, does that mean passing a global array to all my functions? I have several before the reflectiveShade() function is reached and it would be a pain in the ass to code this in (and then to find out the values are wrong, which I know they are).

Ok, I’ve found a bizarre solution - If I set the second loop bound to be vp.maxDepth - 1, instead of vp.maxDepth then it works fine for a maxDepth of 3. If I change it to 4 or 5 or higher, then I get artifacts again. If I change it lower, then it is fine… weird

As MisterAnderson pointed out, if maxDepth is greater than 3 then you’ll exceed the array bounds, and threads may clobber each other’s data if the local memory storage happens to be adjacent to each other.

As for your shared memory implementation, it looks like your block is 16x16, meaning that threadIdx.y*blockDim.x + threadIdx.x can be as large as 255, but the array is only declared to be [64][3]. Exceeding the bounds will probably clobber the data of another block running on the same multiprocessor.

In emulation, the program might exceed the array bounds and silently succeed even though the behavior is undefined. It doesn’t have to be a race condition.

Double check all array accesses. Try smaller blocks (say 8x8=64) or bigger arrays (e.g. [256][6] if possible) and see if the problem goes away. My hunch is that’s where the problem lies.

Ok, thank you Jamie K. I did indeed alter the array bounds when I altered max depth and the shared memory example was using a block size of 8 x 8, so that wasn’t the problem. Anyway, I think I have fixed this now, although I am running into other weird ray tracing problems…