Flaw in 9800 GTX?

After developing a cuda application for a while on an 8600 on my mac laptop, I switched over to the 9800 GTX on my linux desktop. Despite the code working on the 8600’s GPU, a 8800 GT on a mac desktop, and in emulation on all computers, it exhibited extremely strange (and incorrect) behavior on the 9800. At a certain point in the program, I find myself entirely unable to write a new value into a shared memory array. The bug only manifests itself when I perform some initialization before the write attempt. I’ve attached the smallest subset of my code which still exhibits the bug, for your consideration. Am I doing something improper? Is this a compiler or hardware flaw? The bug appears on both of the 9800 GTX cards on the computer in question, so I assume it isn’t due to some manufacturing flaw on the card. This bug has been killing me for many days. Any help would be appreciated.

Thanks,
Jason

#include <stdio.h>

#ifdef DEVICE_EMULATION
#define emu_printf(…) do{ printf(VA_ARGS); } while(0) /* ; /
#define emu_halt_printf(…) do{ printf(VA_ARGS); exit(1); } while(0) /
; /
#else
#define emu_printf(…) while(0) /
; /
#define emu_halt_printf(…) return /
; */
#endif

global void gtx_9900_bug(float *debug_cuda) {

shared unsigned int edge_size_0;
shared unsigned int edge_size_0_warped; // edge_size_0 rounded to the nearest half-warp
shared unsigned int edge_size_1;

shared float messages_out[32];

int desired_message_index = 0;
int process_message_id = 3;

#define MESSAGE_OFFSET edge_size_0

if(threadIdx.x == 0) { // only assign the values once
edge_size_0 = 2;
edge_size_1 = 4;
edge_size_0_warped = 16;
}

__syncthreads();

if(blockDim.x >= edge_size_0_warped + edge_size_1) {
if(threadIdx.x < edge_size_0) {
desired_message_index = threadIdx.x;
process_message_id = 0;
}
else if((threadIdx.x >= edge_size_0_warped) && (threadIdx.x < edge_size_0_warped + edge_size_1)) {
desired_message_index = threadIdx.x - edge_size_0_warped;
process_message_id = 1;
}
else
process_message_id = 3;
}
else if((blockDim.x >= edge_size_0) && (blockDim.x >= edge_size_1)) { // THESE LINES SEEM TO BE NECESSARY TO INDUCE THE BUG, EVEN THOUGH THEY ARE NEVER EXECUTED SINCE BLOCKDIM.X IS LARGE
process_message_id = 3;
}
else { // THESE LINES SEEM TO BE NECESSARY TO INDUCE THE BUG, EVEN THOUGH THEY ARE NEVER EXECUTED SINCE BLOCKDIM.X IS LARGE
emu_halt_printf(“ERROR: blockDim.x = %d, but the edge sizes are %d and %d\n”, blockDim.x, edge_size_0, edge_size_1);
}

__syncthreads();

if(process_message_id < 2) {
messages_out[desired_message_index + MESSAGE_OFFSET * process_message_id] = 0; // THIS LINE SEEMS TO BE NECESSARY TO INDUCE THE BUG
if(blockIdx.x == 0) {
// confirm that this line is actually run - appears starting in the 9th line of the output. The lower-order digits hold the index to which the write is performed
debug_cuda[threadIdx.x + 40] = 550000 + desired_message_index + MESSAGE_OFFSET * process_message_id;
}
}

/* add a delay
int counter;
float counter_var = 1;
for(counter = 0; counter < 10000; counter++) {
if(counter % 3 == 0)
counter_var /= counter;
else if(counter % 3 == 1)
counter_var *= counter;
}
*/

__syncthreads();

if(threadIdx.x < edge_size_0 + edge_size_1) {
messages_out[threadIdx.x] = 80 + threadIdx.x; // THIS WRITE ATTEMPT ONLY SUCCEEDS FOR THREADS 0 AND 1 ON MY 9800 GTX
if(blockIdx.x == 0) {
// confirm that the write to shared memory is actually performed. Either a 77 or a 79 should appear starting in the 5th line of the output, with position corresponding to the index of the attempted write
if(threadIdx.x < edge_size_0) {
debug_cuda[threadIdx.x + 20] = 77;
}
else {
debug_cuda[threadIdx.x + 20] = 79;
}
}
}

__syncthreads();

if((process_message_id < 2) && (blockIdx.x == 0)) {
// read back from shared memory into global memory, to evaluate whether the write to shared memory succeeded. The value put into global memory includes the contents of the associated shared memory location in the high bits, and the index of shared memory accessed in the low bits. These values appear in the beginning of the debug output
debug_cuda[process_message_id * MESSAGE_OFFSET + desired_message_index] = 1000 * messages_out[desired_message_index + MESSAGE_OFFSET * process_message_id] + desired_message_index + MESSAGE_OFFSET * process_message_id;

//debug_cuda[threadIdx.x + 70] = counter_var + 10; // assure that delay code runs

}
}

int main(void) {

float *debug_cuda_copy;
float *debug_cuda;
const int array_size = 1000;

dim3 dim_grid(100);
dim3 dim_block(64);

int i;

int num_gpu_devices, selected_device;
cudaDeviceProp this_device_prop;
cudaError_t this_error;

cudaGetDeviceCount(&num_gpu_devices);
for(i = 0; i < num_gpu_devices; i++) {
cudaGetDeviceProperties(&this_device_prop, i);
printf(“Available device %d: %s\n”, i, this_device_prop.name);
}
selected_device = 0;

if(selected_device < 0)
selected_device = 0;
if(selected_device >= num_gpu_devices)
selected_device = num_gpu_devices - 1;
printf(“Selecting GPU device %d\n”, selected_device);
this_error = cudaSetDevice(selected_device);
if(this_error != cudaSuccess) {
fprintf(stderr, “ERROR: Could not select GPU device %d\n”, selected_device);
exit(1);
}

this_error = cudaMalloc((void**) &debug_cuda, sizeof(float) * array_size);
if(this_error != cudaSuccess) {
fprintf(stderr, “ERROR: Could not allocate GPU memory\n”);
exit(1);
}

cudaMemset(debug_cuda, 0, sizeof(float) * array_size);
debug_cuda_copy = (float*) malloc(sizeof(float) * array_size);

gtx_9900_bug<<<dim_grid, dim_block>>>(debug_cuda);

cudaMemcpy(debug_cuda_copy, debug_cuda, sizeof(float) * array_size, cudaMemcpyDeviceToHost);

printf(“Contents of debug_cuda:\n”);
for(i = 0; i < 100; i++) {
if((i % 5 == 0) && (i > 0))
printf("\n");
printf("%e, “, debug_cuda_copy[i]);
}
printf(”\n\n");

return 0;
}

Thanks, will try to poke at this soon.

Have you managed to make any progress on this problem? I’ve tried restructuring my code in a few different ways, but I can’t seem to work around this bug. The simple changes which eliminate the problem in the test code I posted don’t fix the full version. If I knew what the nature of the difficulty was, I could probably try more effective changes. If nothing else, it would be a bit reassuring to know that this problem occurs on other 9800 GTX cards besides the ones I have plugged into my computer.

Thanks,

Jason

It seems as if this problem was caused by my use of return statements. I experienced a similar problem in a later section of the code, but this time on all of my GPUs. Removing all return statements from error-catching conditionals seems to have solved the problem, even though these return statements should never have actually been executed, and were definitely never executed in device emulation.