CUDA crash and black screen after addAtomic

Hey guys,

I’ve got a problem with CUDA. I want to programm a simple raytracer which traces spheres and is calculated via the gpu. I’ve got all the pixels in a grid and the blocks are managing the samples for one specific pixel. After the hit function call, the color value of the sample should be added to the global pixel array. The problem is now, with 64 samples on 400x400 it works fine. But if I raise the sample count to 256 CUDA crashes and the screen gets black for a short ammount of time. After that I got a message: The display driver was restored after an error.

Here’s the source for the kernel main function:

__global__ void

raytrace_gpu( SSphere* spheres, int num_objects, SColorInt* pixels, SCameraParameter* cam_params)

{

	// read sample index and number of sumples

		int sample = threadIdx.x;

	int num_samples = blockDim.x;

	// read pixel indicees

	int pixel_x = blockIdx.x;

		int pixel_y = blockIdx.y;

	// detect screen resolution

	int res_x = gridDim.x;

	int res_y = gridDim.y;

	// construct ray

	SRay ray;

	SPoint2D sp, pp;

	sp.x = sample / num_samples; sp.y = sample / num_samples;

	pp.x = cam_params->s * (pixel_x - 0.5 * res_x + sp.x); 

	pp.y = cam_params->s * (pixel_y - 0.5 * res_y + sp.y);

	ray.o = cam_params->eye;

	ray.d = get_ray_direction(pp, cam_params->u, cam_params->v, cam_params->w, cam_params->d);

	// raytrace 

	SShadeRec sample_hit = hit_objects(ray, spheres, num_objects);

	sample_hit.color = col_div_float(sample_hit.color, (float)num_samples);

	// compute sample value

	SColorInt sample_color = {sample_hit.color.r * 255, sample_hit.color.g * 255, sample_hit.color.b * 255}; 

	// add sample value to pixel

	atomicAdd(&pixels[pixel_y * res_x + pixel_x].r, sample_color.r);

	atomicAdd(&pixels[pixel_y * res_x + pixel_x].g, sample_color.g);

	atomicAdd(&pixels[pixel_y * res_x + pixel_x].b, sample_color.b);

}

If I don’t execute the atomicAdd function calls, everything works fine with 256 samples.

Does anyone know why this happens or anyone encountered the same problem?

Thanks for help.

Daniel

Looks to me that this code does not work as intended within a block.
sample and num_samples are integers, and because sample is always smaller than num_samples (threadIdx<blockDim.x)
sp.x and sp.y will both be zero (integer division).
Also pixel_x and pixel_y are the same for all threads within a block, so pp.x and pp.y are also the same…
etc…

For the record, I don’t think this is such a good idea. If you’re running 256 threads per block then all (3*)256 global memory accesses for a single block will be serialized, which results in a huge drop in performance.

N.

Yes you’re right with the sample points, I haven’t fixed this yet, but that’s not the problem I think.

How can I manage the task without accessing the global pixel array? Is there a way to do the same with shared variables?

GPUs with compute capability 1.3 support atomics on shared memory.

Maybe you could try to reorganize your grid so that a single thread calculates all samples, this avoids a reduction scheme when blending all samples.

Or you could leave it as it is and perform the reduction in shared memory. Either way, I would stay away from using atomics if possible :)

N.

Thanks for your reply.

I’ll try it this way.

Hi,

I tried the method described above: Each thread now handles a whole pixel.
But the problem remains with the difference that I can now run the program with 400x400 and 256 samples.
If I raise the sample or pixel count, the black screen problem will appear. Is it a memory problem? Should I organize the threads another way?

I got the following exception:

NVAPI: bLhThunkInit: failed assert: lhThunk.pfnGetDisplayConfigBufferSizes
NVAPI: bLhThunkInit: failed assert: lhThunk.pfnQueryDisplayConfig
NVAPI: bLhThunkInit: failed assert: lhThunk.pfnSetDisplayConfig
NVAPI: bLhThunkInit: failed assert: lhThunk.pfnDisplayConfigGetDeviceInfo
cudaError_enum at 0x02affd28…
cudaError_enum at 0x02affcd4…
cudaError_enum at 0x02affcdc…
cudaError_enum at 0x02affcdc…
cudaError_enum at 0x02affcdc…

Thanks for helping.

Are you checking for errors (cudaGetLastError)?
Might be helpful to know which error is thrown :)

N.

I added some error checks. I used the following code to check errors:

if ((err = cudaGetLastError()) != cudaSuccess) {

   printf("CUDA error: %s, line %d\n", cudaGetErrorString(err), __LINE__); 

   return;

}

I also added it after calling the gpu raytrace function, but the debugger doesn’t get any errors.

But as soon as a Cuda function (like cudaMemcpy) gets called after the gpu computation, the screen gets black and the debugger stops here and doesn’t print any errors.

Maybe you didn’t allocate enough memory to store the result, or reading past the array bounds on the device?

N.

The compiler throws the following exception: Access violation while reading at position …

If I comment the line out, where I assign the final pixel value from all samples to the global pixel array, the exception doesn’t occur. Seems like an out-of-bound exception or something like this.

But I don’t understand why this happens, because I’m allocating as much memory as I need for the pixels.

Here’s my host code without error handling:

extern "C" void trace_rays(dim3 grid, dim3 block, 

						   SSphere* spheres, int num_objects, SColor* pixels, int num_samples,

						   SCameraParameter* cam_params) {

	cudaError_t err;

	double gpu_time;

	unsigned int timer;

	// create timer

	cutilCheckError( cutCreateTimer(&timer) );

	// copy spheres to gpu

	SSphere* device_spheres;

	cutilSafeCall(cudaMalloc((void**) &device_spheres, num_objects * sizeof(SSphere)));

	cutilSafeCall(cudaMemcpy(device_spheres, spheres, num_objects * sizeof(SSphere), cudaMemcpyHostToDevice));

	// copy pixels to gpu

	SColor* device_pixels;

	int num_pixels = grid.x * grid.y;

	cutilSafeCall(cudaMalloc((void**) &device_pixels, num_pixels * sizeof(SColor)));

	// copy cam params to gpu

	SCameraParameter* device_cam_params;

	//int num_pixels = grid.x * grid.y;

	cutilSafeCall(cudaMalloc((void**) &device_cam_params, sizeof(SCameraParameter)));

	cutilSafeCall(cudaMemcpy(device_cam_params, cam_params, sizeof(SCameraParameter), cudaMemcpyHostToDevice));

	cutilCheckError( cutResetTimer(timer) );

	cutilCheckError( cutStartTimer(timer) );

	// run the gpu raytracer

	raytrace_gpu<<<grid, block>>>(device_spheres, num_objects, device_pixels, num_samples, device_cam_params);

	// synch threads

	cutilSafeCall(cudaThreadSynchronize());

	// print time

	cutilCheckError( cutStopTimer(timer) );

	gpu_time = cutGetTimerValue(timer);

	printf("raytracing time: %f \n", gpu_time); 

	cutilSafeCall(cudaMemcpy(pixels, device_pixels, num_pixels * sizeof(SColor), cudaMemcpyDeviceToHost));

	

	// free allocated memory

	cutilSafeCall(cudaFree(device_spheres));

	cutilSafeCall(cudaFree(device_pixels));

	cutilSafeCall(cudaFree(device_cam_params));

}

And how much space did you allocate for “SColor* pixels”?

N.

I allocate the same space for the pixels array:

SColor* pixels = new SColor[hres * vres];

and the grid has the same size:

dim3 grid(vp.hres, vp.vres, 1);

Can you post the updated kernel code again, and also exactly how you calculate the grid- and block-dimensions?

N.

Here is the updated kernel code:

__global__ void

raytrace_gpu( SSphere* spheres, int num_objects, SColor* pixels, int num_samples, SCameraParameter* cam_params)

{

	// read pixel indicees

	int pixel_x = blockIdx.x;

	int pixel_y = blockIdx.y;

	// detect screen resolution

	int res_x = gridDim.x;

	int res_y = gridDim.y;

	SColor pixel_color = {0, 0, 0};

	SRay ray;

	SPoint2D sp, pp;

	crap sample_hit;

	// shoot the samples

	for (int i = 0; i < num_samples; i++) {

		// construct ray

		sp.x = 0.5; sp.y = 0.5;	  // just for testing

		pp.x = cam_params->s * (pixel_x - 0.5 * res_x + sp.x); 

		pp.y = cam_params->s * (pixel_y - 0.5 * res_y + sp.y);

		ray.o = cam_params->eye;

		ray.d = get_ray_direction(pp, cam_params->u, cam_params->v, cam_params->w, cam_params->d);

		// raytrace 

		sample_hit = hit_objects(ray, spheres, num_objects);

		// add results

		pixel_color = col_add_col(pixel_color, sample_hit.color);

	}

	// average value

	pixel_color = col_div_float(pixel_color, (float) num_samples);

	pixels[pixel_y * res_x + pixel_x] = pixel_color;

}

The blocks have a size of 1 in each dimension. And the pixels are stored in the grid.x and grid.y variable. Is this a problem? Otherwise the max. size of block.x and block.y are 512 which is too little for a high screen resolution.

Many thanks for helping, Nico :)

Well, if you uncomment the last line then it’s quite possible that the compiler optimizes the code and throws away a lot of your calculations, so it’s not necessarily the last command that writes to global memory which causes the crash.
Does it also crash when you write a hardcoded color to the output?

N.

Yes, it also crashes if I write the color hardcoded to the global array. Is it a memory problem?

The thing is, if I set the pixels to 400x400 and the samples to 256, everything works fine.

But if I now leave the resolution and just set the num of samples to 1024, it crashes…

So I can’t imagine that it’s an overflow error, cause I just raise the samples which has nothing to do with my final pixel array.

Is it a memory problem, maybe?

Ok, if I comment the hit_objects()-function out, everything works fine…
It must be a memory problem. Any ideas how to debug the code when this occurs?

Sorry, two posts ago, I said that with a hardcoded color it also doesn’t work.

That was wrong. If I assign a hardcoded color value to the global pixel value AND do all the hit-tests, it works…Just when I want to assign the computed pixel value to the global pixel array, it fails. Any ideas?

If the hardcoded value works then it’s unlikely that it is a memory problem, after all, the addresses in global memory are effectively being written to.

Looks like the bug is in one of your other routines. Try replacing the other routines with hardcoded values, or comment some until it does not crash, that should locate the bug.

N.

I tried to comment stuff out. But it seems to me that there’s some kind of instruction limit or so.

If I comment things out, it works but if I then raise the nr of samples, it crashes again.

Here is my code to hit test against a sphere, the function which causes the crash:

__device__ bool

sphere_hit(const SRay ray, float& tmin, crap& hit, SSphere sphere) {

	float 		t;

	SVector3D	temp 	= point_sub_point(ray.o, sphere.center);

	float		radius  = sphere.radius; 

	float 		a 		= vec_dot_vec(ray.d, ray.d);

	float 		b 		= vec_dot_vec(temp, ray.d) * 2.0;

	float 		c 		= vec_dot_vec(temp, temp) - radius * radius;

	float 		disc	= b * b - 4.0 * a * c;

	

	

	if (disc < 0.0)

		return(false);

	else {	

		float e = sqrt(disc);

		float denom = 2.0 * a;

		t = (-b - e) / denom;	// smaller root

	

		if (t > kEpsilon) {

			tmin = t;

			hit.color.set_color(1,0,0);

			hit.normal = vec_div_float((vec_add_vec(temp, vec_mul_float(ray.d, t))), sphere.radius);

			hit.local_hit_point = point_add_vec(ray.o, vec_mul_float(ray.d, t));

			return (true);

		} 

	

		t = (-b + e) / denom;	// larger root

	

		if (t > kEpsilon) {

			tmin = t;

			hit.color.set_color(1,0,0);

			hit.normal = vec_div_float((vec_add_vec(temp, vec_mul_float(ray.d, t))), sphere.radius);

			hit.local_hit_point = point_add_vec(ray.o, vec_mul_float(ray.d, t));

			return (true);

		} 

	}

	

	return (false);

}

If I comment the return (true) out in the two if conditions (t > kEpsilon), it works fine with 1024 samples on 400x400, but it crashes with 4096 samples.

On the other hand, if I comment out everything in the conditions (t > kEpsilon), I can raytrace 4000x4000!!! with 4096 samples without any problems.

That’s weird. Someone here had the same problems who can help me?