Odd error fixed by commenting unrelated line?

Okay, this one has officially stumped me and I feel like a chump. We have a little work server with a GTX 295 and a Tesla C1060 we’ve been using for months. One of our guys has been seeing his program produce bogus output periodically, and in the course of commenting out every part of his program that didn’t change the results, found a standardized case that fails predictably (in itself a long haul).

The offending kernel simply sets every value to -1. Nothing cosmic. On the GTX 295, it works; on the C1060, it fails STRANGELY: some values are set to 0.

The weird thing to me is that a line which modifies a DIFFERENT array can cause issues. If you comment it out, the problem goes away.

I are mystified. Here’s the offending kernel:

[codebox]#include

#include <stdlib.h>

// CHANGE ME

#include “/home/vputz/NVIDIA_GPU_Computing_SDK/C/common/inc/cutil.h”

//#include “/opt/cuda/sdk/C/common/inc/cutil.h”

#define device 3

#define grids 2

#define blocksPerGrid 64

#define threadsPerBlock 256

#define streamsPerGrid (blocksPerGrid * threadsPerBlock)

#define iterationsPerOutput 1000

/*

  • Sets every element of values and states to 0.

*/

global void initialize(double * values, int * states)

{

int streamIdx;

streamIdx = blockIdx.x * blockDim.x + threadIdx.x;

states[streamIdx * 3] = 4;

states[streamIdx * 3 + 1] = 5;

states[streamIdx * 3 + 2] = 6;

values[streamIdx] = 5.6;

}

/*

  • Sets every element of values to -1.

*/

global void tick(double * values, int iterations, int * states)

{

int streamIdx;

streamIdx = blockIdx.x * blockDim.x + threadIdx.x;

while (iterations > 0)

{

    // IF YOU COMMENT OUT THE NEXT LINE, THE PROGRAM WORKS

    states[3 * streamIdx] = states[3 * streamIdx + 1];

values[streamIdx] = -1;

–iterations;

}

}

/*

  • Appends the elements of values to a file.

*/

host void writeValues(char * filename, double * values)

{

FILE * file;

int streamIdx;

file = fopen(filename, “a”);

for (streamIdx = 0; streamIdx < streamsPerGrid; streamIdx++)

{

    fprintf(file, "%G\n", values[streamIdx]);

}

fclose(file);

}

/*

  • Main function.

*/

int main(int argc, char ** argv)

{

FILE * file;

double * values_h, * values_d;

int * states_d;

int gridIdx;

cudaSetDevice(device);

// Allocate memory on the host

values_h = (double *) malloc(streamsPerGrid * sizeof(double));

// Allocate memory on the device

CUDA_SAFE_CALL(cudaMalloc((void **) &values_d, streamsPerGrid * sizeof(double)));

CUDA_SAFE_CALL(cudaMalloc((void **) &states_d, streamsPerGrid * 3 * sizeof(int)));

// Create files on the host

file = fopen("initial.dat", "w");

fclose(file);

file = fopen("final.dat", "w");

fclose(file);

for (gridIdx = 0; gridIdx < grids; gridIdx++)

{

    printf("Grid %u\n", gridIdx + 1);

// This sets EVERY NUMBER in values[…] and states[…] to junk values

    initialize<<<blocksPerGrid, threadsPerBlock>>>(values_d, states_d);

    CUDA_SAFE_CALL(cudaThreadSynchronize());

    CUDA_SAFE_CALL(cudaMemcpy(values_h, values_d, streamsPerGrid * sizeof(double), cudaMemcpyDeviceToHost));

    writeValues("initial.dat", values_h);

// This sets EVERY NUMBER in values[…] to -1 and it SHOULD leave states[…] unchanged

    tick<<<blocksPerGrid, threadsPerBlock>>>(values_d, iterationsPerOutput, states_d);

    CUDA_SAFE_CALL(cudaThreadSynchronize());

    CUDA_SAFE_CALL(cudaMemcpy(values_h, values_d, streamsPerGrid * sizeof(double), cudaMemcpyDeviceToHost));

    writeValues("final.dat", values_h);

}

return 0;

}

[/codebox]

This caught my attention because I have a similar problem. Haven’t had the time to had a detailed look at it, but a suggestion might be to try Cuda 3.0. IIRC there’s some functionality there that acts similarly to valgrind.

Letharion–

Good idea and I may check it out, but we actually had a good bit of stability after a few changes. I’m not sure what made the difference or if it’s fixed at all, but the above code now seems to run reliably.

  1. BIOS update. Should be a given that one updates the BIOS, but it had seemed to work earlier so I was reluctant to change anything.

  2. Fiddling with ACPI in the bios. It was giving me problems, so I disabled it during some diagnosis of a bad fan, but that gave a lot of “NVRM: Failed to register” errors in the logs. Reenabling ACPI eliminated those errors in the logs and I haven’t seen any Xid errors recently either.

  3. I was getting “IRQ 10 failed, nobody cares” errors in the logs; the error message suggested enabling the “irqpoll” kernel argument, which I did. The error messages no longer appear in the logs.

Between those three, we were able to get the test program above to work reliably, and the full version of the program ran well enough to get good results. The problem is that now we no longer really trust the setup, but we’ll keep stress-testing it and seeing if it makes a difference. So maybe the code above was just fine and it was entirely a support problem.

I really wish that there was a way to detect those rather than just looking for random bad output.

You’re decrementing the kernel argument “iterations.”

Kernel arguments are not copied locally per thread, they are stored in SHARED memory. So you now have an unexpected thread intercommunication and thread race since you’re modifying the loop variable.

A good coding practice in CUDA is to leave all kernel arguments as read-only just to avoid this surprise.

Try something like:

__global__ void tick(double * values, int kernel_iterations, int * states)

{

	int streamIdx;

	int iterations=kernel_iterations.

I did not know that! Makes good sense why, mind you, and it’s certainly in the faq now that I knew to look for it (and what it would mean). Hm!

Alright, I’d like to resurrect this thread. I’m the author of the original program that gives bogus output.

Yes, kernel arguments are not copied locally by thread – thanks very much for pointing this out. However, the program remains broken even after I corrected this! It appears that we might actually be facing a serious bug in CUDA. We have also tried the program on another server (not run by us) and reproduced the same problem.

To recap, the problem is as follows. The attached program is meant to fill the array “values” with 0, then write the contents of “values” to the file “initial.dat”. It is then meant to fill values with -1 and write the contents of “values” to the file “final.dat”. Additionally, there are some (trivial) operations on another array “states” that occur in the kernels.

If I run the program, there are values of -1 in the file “initial.dat” - but there should only be values of 0. It appears that commenting out a copy operation on “states” can resolve this, but that makes no sense.

I’d be really, really grateful if anyone wants to try this program. This issue is becoming a major problem. If there is something really obvious that we’ve missed, I’d really like to know.

Thanks a lot!

[codebox]

#include <stdio.h>

#include <stdlib.h>

// Change me

#include “/opt/cuda/2.3/sdk/C/common/inc/cutil.h”

#define device 0

#define grids 2

#define blocksPerGrid 64

#define threadsPerBlock 256

#define streamsPerGrid (blocksPerGrid * threadsPerBlock)

#define iterationsPerOutput 1000

/*

  • Sets every element of values and states to 0.

*/

global void initialize(double * values, int * states)

{

int streamIdx;

streamIdx = blockIdx.x * blockDim.x + threadIdx.x;

states[streamIdx * 3] = 0;

states[streamIdx * 3 + 1] = 0;

states[streamIdx * 3 + 2] = 0;

values[streamIdx] = 0.0;

}

/*

  • Sets every element of values to -1.

*/

global void tick(double * values, int iterations, int * states)

{

int streamIdx, iterationIdx;

streamIdx = blockIdx.x * blockDim.x + threadIdx.x;

for (iterationIdx = 0; iterationIdx < iterations; iterationIdx++)

{

    // IF YOU COMMENT OUT THE NEXT LINE, THE PROGRAM WORKS

    states[3 * streamIdx] = states[3 * streamIdx + 1];

values[streamIdx] = -1;

}

}

/*

  • Appends the elements of values to a file.

*/

host void writeValues(char * filename, double * values)

{

FILE * file;

int streamIdx;

file = fopen(filename, “a”);

for (streamIdx = 0; streamIdx < streamsPerGrid; streamIdx++)

{

    fprintf(file, "%G\n", values[streamIdx]);

}

fclose(file);

}

/*

  • Main function.

*/

int main(int argc, char ** argv)

{

FILE * file;

double * values_h, * values_d;

int * states_d;

int gridIdx;

cudaSetDevice(device);

// Allocate memory on the host

values_h = (double *) malloc(streamsPerGrid * sizeof(double));

// Allocate memory on the device

CUDA_SAFE_CALL(cudaMalloc((void **) &values_d, streamsPerGrid * sizeof(double)));

CUDA_SAFE_CALL(cudaMalloc((void **) &states_d, streamsPerGrid * 3 * sizeof(int)));

// Create files on the host

file = fopen("initial.dat", "w");

fclose(file);

file = fopen("final.dat", "w");

fclose(file);

for (gridIdx = 0; gridIdx < grids; gridIdx++)

{

    printf("Grid %u\n", gridIdx + 1);

// This sets EVERY NUMBER in values[…] and states[…] to zero

    initialize<<<blocksPerGrid, threadsPerBlock>>>(values_d, states_d);

    CUDA_SAFE_CALL(cudaThreadSynchronize());

    CUDA_SAFE_CALL(cudaMemcpy(values_h, values_d, streamsPerGrid * sizeof(double), cudaMemcpyDeviceToHost));

    writeValues("initial.dat", values_h);

// This sets EVERY NUMBER in values[…] to -1 and it SHOULD leave states[…] unchanged

    tick<<<blocksPerGrid, threadsPerBlock>>>(values_d, iterationsPerOutput, states_d);

    CUDA_SAFE_CALL(cudaThreadSynchronize());

    CUDA_SAFE_CALL(cudaMemcpy(values_h, values_d, streamsPerGrid * sizeof(double), cudaMemcpyDeviceToHost));

    writeValues("final.dat", values_h);

}

// Free memory on the host

free(values_h);

// Free memory on the device

CUDA_SAFE_CALL(cudaFree(values_d));

CUDA_SAFE_CALL(cudaFree(states_d));

return 0;

}

[/codebox]

Sorry, but not when I run it. Work perfectly. Are you compiling this for compute 1.3? Because otherwise I can easily imagine it failing if nvcc demotes things back to float.

It has worked intermittently before. But, we’ve just tried it again and it is broken.

All machines use CUDA 2.3 and the only compiler flag is -arch sm_13.

I tryed the code, too
And I’ve got the warning that says:
ptxas /tmp/tmpxft_00002220_00000000-2_NVMain.ptx, line 77; warning : Double is not supported. Demoting to float
So I changed all doubles to float, because I have just capability of 1.1.
and now it works fine.

avidday, supsurge:

Yep, it’s still broken–now tested on multiple computers, multiple cards, and multiple versions of the SDK. We’re evidently missing something obvious, but it’s driving us nuts; sometimes it’ll run thousands of times in a row without fault, other times it’ll pop up with errors. I can’t seem to see any pattern in it; we’ve tried compiling under the emulator and looking for faults, I’ve even tried even compiling with ocelot just for fun–no problem.

Really demoralizing and makes me not want to trust cuda; this is completely baffling. I keep feeling it has to be some sort of race condition or something. I do note that our “main” server seems to generate Xid errors periodically, which isn’t great–but this error has existed on other computers, cards, and clusters.

The code also works as expected in my setup … compiling with -arch sm_13 , CUDA 3.0 (Beta), Ubuntu 9.10 x64, gcc 4.3.4, driver 195.17 … and running in a Tesla C1060

As you say, it is no sense that commenting something in the tick function, would affect something that it was supposed to depend only of the initialize function. However, since your are running TWICE both kernels, it seems that in your setup the second time you run initialize and do the copy to the host memory, the memory in the host has still the memory copied after the first call to tick (no idea why), can you add a cudaThreadSynchronize after copying the memory to the host?? However, I don’t have any idea if this can really force the copy operation to be syncho, in principle it should be.

Another suggestion is to upgrade to Cuda Beta 3.0 and driver 195.17… it may be a driver problem, but as you point out , the log indicate that maybe the card is getting some issues and it may be simply defective…

It’s rather more curious than that (and we have had it fail on 3.0, 195.17). We’re still looking at it, but my current pet theory is that some other programs can put the GPU or driver in a state where this test program fails, but I can’t reliably reproduce it. I had a case where I ran a program which corrupted display memory, after which the test program failed (where it had worked earlier for 1000+ runs). Finding a card of our server on which it was failing, I removed and reloaded the driver, and it ran fine for hundreds of runs.

We’re still trying to sort it out; it’s fairly subtle, and the idea that one program could leave the device or driver in a state where the next program fails is a curious one. I may be completely wrong, too.