Problem with two seemingly unrelated sections in my code

fdls2011 · November 10, 2017, 6:04am

Sorry for the confusing topic title, for I really don’t know how to summarize my problem. Anyway, I have the following kernel.

__global__ void myKernel(float *tss_in, int *seeds_in, int *pConfsSum_out, int *nConfsSum_out, int *pConfsAll, int *nConfsAll, double *distMtx,
	const int numSeeds, const int numTrain, const int numP, const int numPLabeled, const int tsLen, const int sLen, const int nextTsIdx){

	extern __shared__ float array[];
	float *ts1 = (float*)array;
	float *ts2 = (float*)&ts1[tsLen];
	
	int tid = threadIdx.x;
	int blockId = blockIdx.x;
	int base = blockId * tsLen;
	int blockSize = blockDim.x;
	
	int numIters = ceil((double)tsLen / blockSize);
	int *pConfs, *nConfs;
	double *distVec;
	bool isValid;
	int idx, numThis, start;
	int elms[3];
	double term1, term2, s1, s1_2, s2, s2_2, mu2, sigma2, corr,
		mu1[MAX_ITERS], sigma1[MAX_ITERS], dotPr[MAX_ITERS], nnCorr[MAX_ITERS];

	for (int i = 0; i < MAX_ITERS; i++)
		mu1[i] = sigma1[i] = dotPr[i] = nnCorr[i] = 0;
	for (int i = 0; i < 3; i++)
		elms[i] = 0;

	for (int i = 0; i < numIters; i++){
		numThis = (i == numIters - 1) ? tsLen - (i * blockSize) : blockSize;
		if (tid < numThis){
			start = i * blockSize + tid;
			ts1[start] = tss_in[nextTsIdx * tsLen + base + start];
		}
	}
	__syncthreads();

	for (int i = 0; i < numIters; i++){
		start = i * blockSize + tid;	
		isValid = start < tsLen - sLen + 1;
		if (isValid){	
			s1 = s1_2 = 0;
			for (int k = 0; k < sLen; k++){
				term1 = ts1[start + k];
				s1 += term1;
				s1_2 += term1 * term1;
			}
			mu1[i] = s1 / sLen;
			sigma1[i] = s1_2 / sLen > mu1[i] * mu1[i] ? sqrt(s1_2 / sLen - mu1[i] * mu1[i]) : 1;
		}
	}

	for (int j = 0; j < numTrain; j++){

		for (int i = 0; i < numIters; i++){
			numThis = (i == numIters - 1) ? tsLen - (i * blockSize) : blockSize;
			if (tid < numThis){
				start = i * blockSize + tid;
				ts2[start] = tss_in[j * tsLen + start];
			}
		}
		__syncthreads();

		for (int i = 0; i < numIters; i++){
			start = i * blockSize + tid;
			isValid = start < tsLen - sLen + 1;
			if (isValid){
				//initiation
				s2 = s2_2 = 0;
				for (int k = 0; k < sLen; k++){
					/*term2 = ts2[k];
					s2 += term2;
					s2_2 += term2 * term2;*/
				}

			}
		}
		__syncthreads();

	}

	for (int i = 0; i < numSeeds; i++){
		

		numIters = ceil((double)tsLen / blockSize);
		for(int w = 0; w < numIters; w++){
			start = w * blockSize + tid;
			if (start < tsLen){
				idx = (blockId * tsLen + start) * numTrain;
				pConfs = &pConfsAll[idx];
				nConfs = &nConfsAll[idx];
				//for (k = 0; k < numTrain; k++){
				for(int a = 0; a < numTrain; a++){
					pConfs[a] = 100;
					nConfs[a] = 100;
				}
			}
		}
		__syncthreads();

		if (blockId + tid == 0 && i == 0){
			int cnt = 0;
			for (int w = 0; w < numTrain; w++){
				for (int c = 0; c < tsLen; c++){
					if (pConfsAll[w * c * numTrain + 70] != 100){
						printf("s = %d, c = %d, val = %d\n", w, c, pConfsAll[w * c * numTrain + 70]);
						cnt++;
					}
					if (cnt == 100)
						break;
				}
				if (cnt == 100)
					break;
			}
		}
	}
}

The array pConfsAll is set to have a size of numTrain * tsLen * numTrain * sizeof(int). The grid size is numTrain and the block size is tsLen in this particular test case (although they can be set to values smaller than their current values).

Lines 100-114 are intended for testing. I thought that nothing should be printed since all values in pConfsAll are supposed to be set to 100. When the loop body in lines 69-71 was commented out, nothing was printed as expected. What baffles me is that when the loop body was included as functional code, something did get printed, indicating that certain values in pConfsAll got unset. This surprises me since it seems to me that lines 69-71 has nothing to do with lines 100-114.

Could someone help me with this problem? Thank you!

njuffa · November 10, 2017, 7:41am

Have you run under control of cuda-memcheck, with all checkers enabled? This should find most out of bounds accesses and a decent portion of conditions.

Beyond that, I would suggest using standard debugging techniques, e.g. tracing variable values back from the point of failure. Check for any instances of undefined C++ behavior. Your code could contain simple design bug. Familiarize yourself with the CUDA debugger, if you haven’t done so yet.

Debugging with printf() can be valuable by creating a log of variable settings etc. Be aware that the ring buffer for device-side printf() is of limited size and can overflow when presented with too much data, making the output misleading since some data will be missing. As I recall, the buffer size is adjustable with a CUDA API call.

fdls2011 · November 10, 2017, 8:37am

Thanks for the answer. I tried to run cuda-memcheck using the original grid size (numTrain) and block size (tsLen). Nothing seemed to have changed. However, after I changed the grid size to 1 while keeping the previous block size, the kernel crashed with “unspecified launch failure” and I got the following results from cuda-memcheck.

========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00001428 in myKernel(float*, int*, int*, int*, int*, int*, double*, int, int, int, int, int, int, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x6004596f8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuD3D10GetDevices + 0x19339f) [0x19ac95]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x2422]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x19a5]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xeeb3]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xed63]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xd241]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xddd9]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x20700]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaThreadSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuD3D10GetDevices + 0x2a9195) [0x2b0a8b]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x10f5]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xddde]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x20700]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuModuleUnload. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuD3D10GetDevices + 0x1a4558) [0x1abe4e]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x82f6]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x8fe8]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xbb76]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x5caf]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x6066]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x1e7f7]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xde14]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x20700]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= ERROR SUMMARY: 3 errors

Could you please tell me what this might be due to? Thanks!

njuffa · November 10, 2017, 8:52am

Unspecified launch failure means the kernel had an out-of-bounds memory access and was killed. cuda-memcheck gives you details: Invalid global read of size 4.

If you use a debug build, cuda-memcheck can tell you what line in the code contains this out-of-bounds global memory read.

Topic		Replies	Views
Invalid __global__ write of size 4. Need help with debugging CUDA Programming and Performance cuda	3	929	October 12, 2021
Need Help. CUDA kernel fails randomly CUDA Programming and Performance cuda , kernel	3	506	July 27, 2022
The result of device code is wrong when without a "printf" CUDA Programming and Performance	8	2001	October 22, 2016
Unspecified launch failure error CUDA Programming and Performance	10	17743	January 6, 2018
VisionWorks+CUDA Segmentation Fault Jetson TX2	23	2219	October 25, 2017
an illegal memory access was encountered CUDA Programming and Performance	7	63349	November 10, 2017
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	2980	November 13, 2017
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7876	June 14, 2013
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16084	September 29, 2017
CUDA Address out of bounds error - help! CUDA Programming and Performance	0	828	August 8, 2019

Problem with two seemingly unrelated sections in my code

Related topics