Problem with two seemingly unrelated sections in my code

Sorry for the confusing topic title, for I really don’t know how to summarize my problem. Anyway, I have the following kernel.

__global__ void myKernel(float *tss_in, int *seeds_in, int *pConfsSum_out, int *nConfsSum_out, int *pConfsAll, int *nConfsAll, double *distMtx,
	const int numSeeds, const int numTrain, const int numP, const int numPLabeled, const int tsLen, const int sLen, const int nextTsIdx){

	extern __shared__ float array[];
	float *ts1 = (float*)array;
	float *ts2 = (float*)&ts1[tsLen];
	
	int tid = threadIdx.x;
	int blockId = blockIdx.x;
	int base = blockId * tsLen;
	int blockSize = blockDim.x;
	
	int numIters = ceil((double)tsLen / blockSize);
	int *pConfs, *nConfs;
	double *distVec;
	bool isValid;
	int idx, numThis, start;
	int elms[3];
	double term1, term2, s1, s1_2, s2, s2_2, mu2, sigma2, corr,
		mu1[MAX_ITERS], sigma1[MAX_ITERS], dotPr[MAX_ITERS], nnCorr[MAX_ITERS];

	for (int i = 0; i < MAX_ITERS; i++)
		mu1[i] = sigma1[i] = dotPr[i] = nnCorr[i] = 0;
	for (int i = 0; i < 3; i++)
		elms[i] = 0;

	for (int i = 0; i < numIters; i++){
		numThis = (i == numIters - 1) ? tsLen - (i * blockSize) : blockSize;
		if (tid < numThis){
			start = i * blockSize + tid;
			ts1[start] = tss_in[nextTsIdx * tsLen + base + start];
		}
	}
	__syncthreads();

	for (int i = 0; i < numIters; i++){
		start = i * blockSize + tid;	
		isValid = start < tsLen - sLen + 1;
		if (isValid){	
			s1 = s1_2 = 0;
			for (int k = 0; k < sLen; k++){
				term1 = ts1[start + k];
				s1 += term1;
				s1_2 += term1 * term1;
			}
			mu1[i] = s1 / sLen;
			sigma1[i] = s1_2 / sLen > mu1[i] * mu1[i] ? sqrt(s1_2 / sLen - mu1[i] * mu1[i]) : 1;
		}
	}

	for (int j = 0; j < numTrain; j++){

		for (int i = 0; i < numIters; i++){
			numThis = (i == numIters - 1) ? tsLen - (i * blockSize) : blockSize;
			if (tid < numThis){
				start = i * blockSize + tid;
				ts2[start] = tss_in[j * tsLen + start];
			}
		}
		__syncthreads();

		for (int i = 0; i < numIters; i++){
			start = i * blockSize + tid;
			isValid = start < tsLen - sLen + 1;
			if (isValid){
				//initiation
				s2 = s2_2 = 0;
				for (int k = 0; k < sLen; k++){
					/*term2 = ts2[k];
					s2 += term2;
					s2_2 += term2 * term2;*/
				}

			}
		}
		__syncthreads();

	}

	for (int i = 0; i < numSeeds; i++){
		

		numIters = ceil((double)tsLen / blockSize);
		for(int w = 0; w < numIters; w++){
			start = w * blockSize + tid;
			if (start < tsLen){
				idx = (blockId * tsLen + start) * numTrain;
				pConfs = &pConfsAll[idx];
				nConfs = &nConfsAll[idx];
				//for (k = 0; k < numTrain; k++){
				for(int a = 0; a < numTrain; a++){
					pConfs[a] = 100;
					nConfs[a] = 100;
				}
			}
		}
		__syncthreads();

		if (blockId + tid == 0 && i == 0){
			int cnt = 0;
			for (int w = 0; w < numTrain; w++){
				for (int c = 0; c < tsLen; c++){
					if (pConfsAll[w * c * numTrain + 70] != 100){
						printf("s = %d, c = %d, val = %d\n", w, c, pConfsAll[w * c * numTrain + 70]);
						cnt++;
					}
					if (cnt == 100)
						break;
				}
				if (cnt == 100)
					break;
			}
		}
	}
}

The array pConfsAll is set to have a size of numTrain * tsLen * numTrain * sizeof(int). The grid size is numTrain and the block size is tsLen in this particular test case (although they can be set to values smaller than their current values).

Lines 100-114 are intended for testing. I thought that nothing should be printed since all values in pConfsAll are supposed to be set to 100. When the loop body in lines 69-71 was commented out, nothing was printed as expected. What baffles me is that when the loop body was included as functional code, something did get printed, indicating that certain values in pConfsAll got unset. This surprises me since it seems to me that lines 69-71 has nothing to do with lines 100-114.

Could someone help me with this problem? Thank you!

Have you run under control of cuda-memcheck, with all checkers enabled? This should find most out of bounds accesses and a decent portion of conditions.

Beyond that, I would suggest using standard debugging techniques, e.g. tracing variable values back from the point of failure. Check for any instances of undefined C++ behavior. Your code could contain simple design bug. Familiarize yourself with the CUDA debugger, if you haven’t done so yet.

Debugging with printf() can be valuable by creating a log of variable settings etc. Be aware that the ring buffer for device-side printf() is of limited size and can overflow when presented with too much data, making the output misleading since some data will be missing. As I recall, the buffer size is adjustable with a CUDA API call.

Thanks for the answer. I tried to run cuda-memcheck using the original grid size (numTrain) and block size (tsLen). Nothing seemed to have changed. However, after I changed the grid size to 1 while keeping the previous block size, the kernel crashed with “unspecified launch failure” and I got the following results from cuda-memcheck.

========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00001428 in myKernel(float*, int*, int*, int*, int*, int*, double*, int, int, int, int, int, int, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x6004596f8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuD3D10GetDevices + 0x19339f) [0x19ac95]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x2422]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x19a5]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xeeb3]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xed63]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xd241]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xddd9]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x20700]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaThreadSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuD3D10GetDevices + 0x2a9195) [0x2b0a8b]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x10f5]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xddde]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x20700]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuModuleUnload. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuD3D10GetDevices + 0x1a4558) [0x1abe4e]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x82f6]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x8fe8]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xbb76]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x5caf]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x6066]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x1e7f7]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0xde14]
=========     Host Frame:G:\库\Documents\Visual Studio 2013\Projects\shGpu\shGpu\test.exe [0x20700]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x12774]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x70d51]
=========
========= ERROR SUMMARY: 3 errors

Could you please tell me what this might be due to? Thanks!

Unspecified launch failure means the kernel had an out-of-bounds memory access and was killed. cuda-memcheck gives you details: Invalid global read of size 4.

If you use a debug build, cuda-memcheck can tell you what line in the code contains this out-of-bounds global memory read.