Odd error fixed by commenting unrelated line?

vputz · January 14, 2010, 7:52pm

Okay, this one has officially stumped me and I feel like a chump. We have a little work server with a GTX 295 and a Tesla C1060 we’ve been using for months. One of our guys has been seeing his program produce bogus output periodically, and in the course of commenting out every part of his program that didn’t change the results, found a standardized case that fails predictably (in itself a long haul).

The offending kernel simply sets every value to -1. Nothing cosmic. On the GTX 295, it works; on the C1060, it fails STRANGELY: some values are set to 0.

The weird thing to me is that a line which modifies a DIFFERENT array can cause issues. If you comment it out, the problem goes away.

I are mystified. Here’s the offending kernel:

[codebox]#include

#include <stdlib.h>

// CHANGE ME

#include “/home/vputz/NVIDIA_GPU_Computing_SDK/C/common/inc/cutil.h”

//#include “/opt/cuda/sdk/C/common/inc/cutil.h”

#define device 3

#define grids 2

#define blocksPerGrid 64

#define threadsPerBlock 256

#define streamsPerGrid (blocksPerGrid * threadsPerBlock)

#define iterationsPerOutput 1000

/*

Sets every element of values and states to 0.

*/

global void initialize(double * values, int * states)

{

int streamIdx;

streamIdx = blockIdx.x * blockDim.x + threadIdx.x;

states[streamIdx * 3] = 4;

states[streamIdx * 3 + 1] = 5;

states[streamIdx * 3 + 2] = 6;

values[streamIdx] = 5.6;

}

/*

Sets every element of values to -1.

*/

global void tick(double * values, int iterations, int * states)

{

int streamIdx;

streamIdx = blockIdx.x * blockDim.x + threadIdx.x;

while (iterations > 0)

{

    // IF YOU COMMENT OUT THE NEXT LINE, THE PROGRAM WORKS

    states[3 * streamIdx] = states[3 * streamIdx + 1];

values[streamIdx] = -1;

–iterations;

}

/*

Appends the elements of values to a file.

*/

host void writeValues(char * filename, double * values)

{

FILE * file;

int streamIdx;

file = fopen(filename, “a”);

for (streamIdx = 0; streamIdx < streamsPerGrid; streamIdx++)

{

    fprintf(file, "%G\n", values[streamIdx]);

}

fclose(file);

}

/*

Main function.

*/

int main(int argc, char ** argv)

{

FILE * file;

double * values_h, * values_d;

int * states_d;

int gridIdx;

cudaSetDevice(device);

// Allocate memory on the host

values_h = (double *) malloc(streamsPerGrid * sizeof(double));

// Allocate memory on the device

CUDA_SAFE_CALL(cudaMalloc((void **) &values_d, streamsPerGrid * sizeof(double)));

CUDA_SAFE_CALL(cudaMalloc((void **) &states_d, streamsPerGrid * 3 * sizeof(int)));

// Create files on the host

file = fopen("initial.dat", "w");

fclose(file);

file = fopen("final.dat", "w");

fclose(file);

for (gridIdx = 0; gridIdx < grids; gridIdx++)

{

    printf("Grid %u\n", gridIdx + 1);

// This sets EVERY NUMBER in values[…] and states[…] to junk values

    initialize<<<blocksPerGrid, threadsPerBlock>>>(values_d, states_d);

    CUDA_SAFE_CALL(cudaThreadSynchronize());

    CUDA_SAFE_CALL(cudaMemcpy(values_h, values_d, streamsPerGrid * sizeof(double), cudaMemcpyDeviceToHost));

    writeValues("initial.dat", values_h);

// This sets EVERY NUMBER in values[…] to -1 and it SHOULD leave states[…] unchanged

    tick<<<blocksPerGrid, threadsPerBlock>>>(values_d, iterationsPerOutput, states_d);

    CUDA_SAFE_CALL(cudaThreadSynchronize());

    CUDA_SAFE_CALL(cudaMemcpy(values_h, values_d, streamsPerGrid * sizeof(double), cudaMemcpyDeviceToHost));

    writeValues("final.dat", values_h);

}

return 0;

}

[/codebox]

Letharion · January 15, 2010, 9:59am

This caught my attention because I have a similar problem. Haven’t had the time to had a detailed look at it, but a suggestion might be to try Cuda 3.0. IIRC there’s some functionality there that acts similarly to valgrind.

vputz · January 16, 2010, 1:48pm

Letharion–

Good idea and I may check it out, but we actually had a good bit of stability after a few changes. I’m not sure what made the difference or if it’s fixed at all, but the above code now seems to run reliably.

BIOS update. Should be a given that one updates the BIOS, but it had seemed to work earlier so I was reluctant to change anything.
Fiddling with ACPI in the bios. It was giving me problems, so I disabled it during some diagnosis of a bad fan, but that gave a lot of “NVRM: Failed to register” errors in the logs. Reenabling ACPI eliminated those errors in the logs and I haven’t seen any Xid errors recently either.
I was getting “IRQ 10 failed, nobody cares” errors in the logs; the error message suggested enabling the “irqpoll” kernel argument, which I did. The error messages no longer appear in the logs.

Between those three, we were able to get the test program above to work reliably, and the full version of the program ran well enough to get good results. The problem is that now we no longer really trust the setup, but we’ll keep stress-testing it and seeing if it makes a difference. So maybe the code above was just fine and it was entirely a support problem.

I really wish that there was a way to detect those rather than just looking for random bad output.

SPWorley · January 17, 2010, 2:09am

You’re decrementing the kernel argument “iterations.”

Kernel arguments are not copied locally per thread, they are stored in SHARED memory. So you now have an unexpected thread intercommunication and thread race since you’re modifying the loop variable.

A good coding practice in CUDA is to leave all kernel arguments as read-only just to avoid this surprise.

Try something like:

__global__ void tick(double * values, int kernel_iterations, int * states)

{

	int streamIdx;

	int iterations=kernel_iterations.

vputz · January 18, 2010, 12:02pm

I did not know that! Makes good sense why, mind you, and it’s certainly in the faq now that I knew to look for it (and what it would mean). Hm!

derivative · February 12, 2010, 11:52am

Alright, I’d like to resurrect this thread. I’m the author of the original program that gives bogus output.

Yes, kernel arguments are not copied locally by thread – thanks very much for pointing this out. However, the program remains broken even after I corrected this! It appears that we might actually be facing a serious bug in CUDA. We have also tried the program on another server (not run by us) and reproduced the same problem.

To recap, the problem is as follows. The attached program is meant to fill the array “values” with 0, then write the contents of “values” to the file “initial.dat”. It is then meant to fill values with -1 and write the contents of “values” to the file “final.dat”. Additionally, there are some (trivial) operations on another array “states” that occur in the kernels.

If I run the program, there are values of -1 in the file “initial.dat” - but there should only be values of 0. It appears that commenting out a copy operation on “states” can resolve this, but that makes no sense.

I’d be really, really grateful if anyone wants to try this program. This issue is becoming a major problem. If there is something really obvious that we’ve missed, I’d really like to know.

Thanks a lot!

[codebox]

#include <stdio.h>

#include <stdlib.h>

// Change me

#include “/opt/cuda/2.3/sdk/C/common/inc/cutil.h”

#define device 0

#define grids 2

#define blocksPerGrid 64

#define threadsPerBlock 256

#define streamsPerGrid (blocksPerGrid * threadsPerBlock)

#define iterationsPerOutput 1000

/*

Sets every element of values and states to 0.

*/

global void initialize(double * values, int * states)

{

int streamIdx;

streamIdx = blockIdx.x * blockDim.x + threadIdx.x;

states[streamIdx * 3] = 0;

states[streamIdx * 3 + 1] = 0;

states[streamIdx * 3 + 2] = 0;

values[streamIdx] = 0.0;

}

/*

Sets every element of values to -1.

*/

global void tick(double * values, int iterations, int * states)

{

int streamIdx, iterationIdx;

streamIdx = blockIdx.x * blockDim.x + threadIdx.x;

for (iterationIdx = 0; iterationIdx < iterations; iterationIdx++)

{

    // IF YOU COMMENT OUT THE NEXT LINE, THE PROGRAM WORKS

    states[3 * streamIdx] = states[3 * streamIdx + 1];

values[streamIdx] = -1;

}

/*

Appends the elements of values to a file.

*/

host void writeValues(char * filename, double * values)

{

FILE * file;

int streamIdx;

file = fopen(filename, “a”);

for (streamIdx = 0; streamIdx < streamsPerGrid; streamIdx++)

{

    fprintf(file, "%G\n", values[streamIdx]);

}

fclose(file);

}

/*

Main function.

*/

int main(int argc, char ** argv)

{

FILE * file;

double * values_h, * values_d;

int * states_d;

int gridIdx;

cudaSetDevice(device);

// Allocate memory on the host

values_h = (double *) malloc(streamsPerGrid * sizeof(double));

// Allocate memory on the device

CUDA_SAFE_CALL(cudaMalloc((void **) &values_d, streamsPerGrid * sizeof(double)));

CUDA_SAFE_CALL(cudaMalloc((void **) &states_d, streamsPerGrid * 3 * sizeof(int)));

// Create files on the host

file = fopen("initial.dat", "w");

fclose(file);

file = fopen("final.dat", "w");

fclose(file);

for (gridIdx = 0; gridIdx < grids; gridIdx++)

{

    printf("Grid %u\n", gridIdx + 1);

// This sets EVERY NUMBER in values[…] and states[…] to zero

    initialize<<<blocksPerGrid, threadsPerBlock>>>(values_d, states_d);

    CUDA_SAFE_CALL(cudaThreadSynchronize());

    CUDA_SAFE_CALL(cudaMemcpy(values_h, values_d, streamsPerGrid * sizeof(double), cudaMemcpyDeviceToHost));

    writeValues("initial.dat", values_h);

// This sets EVERY NUMBER in values[…] to -1 and it SHOULD leave states[…] unchanged

    tick<<<blocksPerGrid, threadsPerBlock>>>(values_d, iterationsPerOutput, states_d);

    CUDA_SAFE_CALL(cudaThreadSynchronize());

    CUDA_SAFE_CALL(cudaMemcpy(values_h, values_d, streamsPerGrid * sizeof(double), cudaMemcpyDeviceToHost));

    writeValues("final.dat", values_h);

}

// Free memory on the host

free(values_h);

// Free memory on the device

CUDA_SAFE_CALL(cudaFree(values_d));

CUDA_SAFE_CALL(cudaFree(states_d));

return 0;

}

[/codebox]

avidday · February 12, 2010, 12:56pm

Sorry, but not when I run it. Work perfectly. Are you compiling this for compute 1.3? Because otherwise I can easily imagine it failing if nvcc demotes things back to float.

derivative · February 12, 2010, 2:21pm

It has worked intermittently before. But, we’ve just tried it again and it is broken.

All machines use CUDA 2.3 and the only compiler flag is -arch sm_13.

Supsurge · February 12, 2010, 2:42pm

I tryed the code, too
And I’ve got the warning that says:
ptxas /tmp/tmpxft_00002220_00000000-2_NVMain.ptx, line 77; warning : Double is not supported. Demoting to float
So I changed all doubles to float, because I have just capability of 1.1.
and now it works fine.

vputz · February 16, 2010, 7:05pm

avidday, supsurge:

Yep, it’s still broken–now tested on multiple computers, multiple cards, and multiple versions of the SDK. We’re evidently missing something obvious, but it’s driving us nuts; sometimes it’ll run thousands of times in a row without fault, other times it’ll pop up with errors. I can’t seem to see any pattern in it; we’ve tried compiling under the emulator and looking for faults, I’ve even tried even compiling with ocelot just for fun–no problem.

Really demoralizing and makes me not want to trust cuda; this is completely baffling. I keep feeling it has to be some sort of race condition or something. I do note that our “main” server seems to generate Xid errors periodically, which isn’t great–but this error has existed on other computers, cards, and clusters.

SPichardo · February 17, 2010, 4:02pm

The code also works as expected in my setup … compiling with -arch sm_13 , CUDA 3.0 (Beta), Ubuntu 9.10 x64, gcc 4.3.4, driver 195.17 … and running in a Tesla C1060

As you say, it is no sense that commenting something in the tick function, would affect something that it was supposed to depend only of the initialize function. However, since your are running TWICE both kernels, it seems that in your setup the second time you run initialize and do the copy to the host memory, the memory in the host has still the memory copied after the first call to tick (no idea why), can you add a cudaThreadSynchronize after copying the memory to the host?? However, I don’t have any idea if this can really force the copy operation to be syncho, in principle it should be.

Another suggestion is to upgrade to Cuda Beta 3.0 and driver 195.17… it may be a driver problem, but as you point out , the log indicate that maybe the card is getting some issues and it may be simply defective…

vputz · February 17, 2010, 4:46pm

It’s rather more curious than that (and we have had it fail on 3.0, 195.17). We’re still looking at it, but my current pet theory is that some other programs can put the GPU or driver in a state where this test program fails, but I can’t reliably reproduce it. I had a case where I ran a program which corrupted display memory, after which the test program failed (where it had worked earlier for 1000+ runs). Finding a card of our server on which it was failing, I removed and reloaded the driver, and it ran fine for hundreds of runs.

We’re still trying to sort it out; it’s fairly subtle, and the idea that one program could leave the device or driver in a state where the next program fails is a curious one. I may be completely wrong, too.

Topic		Replies	Views
Peculiar Shared Memory Behavior - NEED HELP! A test program appears to have two arrays sharing v CUDA Programming and Performance	21	2372	June 5, 2011
Possible nvcc bug? CUDA Programming and Performance	13	8907	January 9, 2011
CUDA 2.1 Beta Problem/Bugs (Linux) CUDA Programming and Performance	5	1742	January 6, 2009
Cant modify data on the GPU CUDA Programming and Performance	16	10371	December 20, 2008
CUDA Release vs. Debug CUDA Programming and Performance	10	5340	May 13, 2010
Device memory access bug? CUDA Programming and Performance	29	2365	June 16, 2011
[Beginner] Memory is reseted in the kernel CUDA Programming and Performance	5	1380	October 29, 2010
help with a weird bug in the code doesn't writes to memory CUDA Programming and Performance	2	1644	January 28, 2010
The kernel isn't working CUDA Programming and Performance	9	1160	January 19, 2011
QUIT CUDA? Kernel and pinned memory gives strange results CUDA Programming and Performance	6	6792	September 22, 2011

Odd error fixed by commenting unrelated line?

Related topics