CUDA code randomly works, and returns wrong results

vitorotri · June 15, 2021, 12:47pm

Hi. Is it possible that I am having problems due to fault hardware (specifically motherboard)?

I checked and my CUDA toolkit is correctly installed (SDK version, drives etc). I just bought a new video board this week, with compute capability 3.5 (Kepler), but my program still works randomly and when it works, it returns wrong results. I have a OpenMP version of the code (which works correctly), and checked many times the “translation”, there is nothing wrong apparently, I really cannot find kernel errors. Sample CUDA code of the sum of two arrays, both with size 105000000, works correctly though. I would appreciate any enlightenment.

Robert_Crovella · June 15, 2021, 1:51pm

I would suspect your code first, before focusing on HW or system/infrastructure. That’s just a general statement; obviously I have no knowledge of your code. But the CUDA sum of two arrays working correctly is a reasonable test of the system.

Good practices here are to make sure your code uses proper CUDA error checking (google that phrase, take the first hit, apply it to your code) and also run your code with cuda-memcheck or compute-sanitizer. If any errors are reported by any of that, start your debug focus there.

vitorotri · June 15, 2021, 2:03pm

Thank you for the fast reply.

Actually, any CUDA call I make, the program just closes randomly. So, if I try to call cudamemcheck, it may or may not work, but I am assured that it is not a memory problem (int the sense of total storage in device), as I need much less than 1GB in memory (about 0.007 GB).

Robert_Crovella · June 15, 2021, 2:28pm

There are any number of possibilities such as improper use of managed memory, stack corruption in your program and many others, that we are unlikely to be able to zero in on with a sequence of questions.

Another possible approach would be to strip your code down to a minimal reproducer and share that code here; if you wish. The community may spot something for you. Even if you don’t post it here, this is often a good debugging practice to narrow the scope of your focus.

If your claim is that any CUDA call you make, the program closes randomly, then it should only be necessary to write about 5 lines of code to see if that holds. If those 5 lines work reliably, keep adding more of your code until the problem appears. That is just one possible approach based on what you’ve shared so far.

If you decide to provide an example, I would be sure to do the following for the best possible help:

Provide a minimal but complete code. I should be able to copy, paste, compile, and run the code you post without having to change anything or add anything.
Test to see that the code you post actually demonstrates the problem in your case.
Provide a complete description of how you compile the code.
Provide a complete description of your environment: The exact GPU model, the host operating system, the CUDA version and the driver version.
Provide the actual commands you use to run the code.

Much of this can be done simply by copying a pasting an appropriate portion of a console session. You will find many examples on these forums if you poke around.

Robert_Crovella · June 15, 2021, 2:56pm

If you want to follow my instructions, I’ll take a look. Otherwise perhaps someone else will be able to help you. Good luck!

vitorotri · June 15, 2021, 2:57pm

Yes, please take a look.

Robert_Crovella · June 15, 2021, 2:57pm

If you follow my instructions, I will take a look. So far you have not followed my instructions. You’re welcome to do as you wish of course, perhaps someone else will be able to help you.

vitorotri · June 15, 2021, 3:05pm

Oh I see, you mean providing complete code etc. Actually it is a very big program, with GUI interface programmed with Dear ImGui on Love2D. I think I will have to wait for other answers, maybe I find the problem. Unless you could install Dear ImGui for Love2D in Windows, then it is possible. Very grateful anyway.

vitorotri · June 16, 2021, 12:22am

Ok, so I am posting some part of the code, to check if someone helps me find if there lies the problem. I am associating each cartesian index with onde posisiton (x,y,z) in a 3D grid (I heard this is kind of bad practice, but I have read some papers that express this makes no difference at all):

__global__ void bounds(float *pnn,float *pn,float *p,float *np,float* vEE,int *tp,int X,int Y,int Z,float lambda,float tau_T){	
	int i = blockIdx.x*blockDim.x + threadIdx.x + 1; // starts at 1
	int j = blockIdx.y*blockDim.y + threadIdx.y + 1; // starts at 1
	int k = blockIdx.z*blockDim.z + threadIdx.z + 1; // starts at 1
	
	float S1, S2, S3, S4, S5, S6;
	
	if ((i < X-1) && (j < Y-1) && (k < Z-1)){

		.
		.
		.

		pnn[flat(i,j,k,X,Y)] = (S1 + S2 - S3 + tau_T*(S4 + S5))/S6;
		
		// exchange section
		
		__syncthreads();
		np[flat(i,j,k,X,Y)] = p[flat(i,j,k,X,Y)];
		__syncthreads();
		p[flat(i,j,k,X,Y)] = pn[flat(i,j,k,X,Y)];
		__syncthreads();
		pn[flat(i,j,k,X,Y)] = pnn[flat(i,j,k,X,Y)];
	}
}

vitorotri · July 2, 2021, 12:13pm

Hi.

Just posting to say that the problem was solved. It was a host problem related to file straming, nothing related to CUDA nor hardware.

system · August 31, 2021, 12:13pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Very simple CUDA program bad output CUDA Programming and Performance	3	762	July 3, 2017
Getting around apparent CUDA bugs CUDA Programming and Performance	5	967	September 20, 2011
Program gives unexpected error compiles smooth, but output is unexpected result CUDA Programming and Performance	5	3295	October 17, 2011
Out of memory on simple summation matrix kernel CUDA Programming and Performance	2	1142	July 25, 2015
A few thaughts about CUDA CUDA Programming and Performance	8	7758	January 7, 2010
Need Help. CUDA kernel fails randomly CUDA Programming and Performance cuda , kernel	3	506	July 27, 2022
Incosistent results - can't explain CUDA Programming and Performance	18	3062	May 10, 2010
CUDA_ERROR_ILLEGAL_ADDRESS CUDA Programming and Performance	6	10971	September 26, 2017
Cant modify data on the GPU CUDA Programming and Performance	16	10242	December 20, 2008
CUDA kernels giving bad results CUDA Programming and Performance	7	12507	February 17, 2011

CUDA code randomly works, and returns wrong results

Related topics