memory access error

Heba_Khaled · January 9, 2013, 9:48pm

my program calls the kernel in a for loop, I make sure that the max. data passed to be shared in less than 15K and the global data to be less 1GB in each iteration oft he kernel execution. the data at the device i then copied from the global memory to shared for each block to start the execution… the first iteration is the only one that gets results from the device kernel, those results are not all right… I think the problem is in the memory accessing inside the kernel any suggestions??

const int GLOBAL_MAX_BYTES = 900000; // global size
const int SHARED_MAX_BYTES = 15900;  // shared size

__constant__ char *query = "MPKIIEAIYENGVFKPLQKVDLKEGEKIRILLKKIDVEKFIMAKLPEEKIRELERRFEDENLY"; //hardcoded
__constant__ const int queryLength = 63; //hardcoded
const int hostQueryLength = 63;

__global__ void Smith_Waterman_Kernel(char *records, int *lengths, int *offset, int *results, int maxRecLen)
{
	int bid = blockIdx.x;
	int tid;
	int idx = 0;

	int i, j;
	
	extern __shared__ char allShared[];// for passing multiple arrays of shared memory
	__shared__ char *rec;
	__shared__ int *dk_1;
	__shared__ int *dk_2;

	rec = allShared;
	dk_1 = (int*)(allShared + maxRecLen);
	dk_2 = dk_1 + queryLength;

	int recLen = lengths[bid];
	tid = threadIdx.x * recLen;
	
	for (int x = threadIdx.x; x < recLen; x += blockDim.x)
		rec[x] = records[offset[bid] + x];

	__syncthreads();
	
	int similar = 0 ;
	int end = queryLength + recLen - 2 - (threadIdx.x * 2);
	for(int k=0; k <= queryLength + recLen - 2; k++)
	{
		idx = k - threadIdx.x;
		int value=0;
		similar = 0;
		if((idx < 0) || (idx > end)) ;
		else
		{
			i = tid / recLen;
			j = tid % recLen;

			// fetch and compare
			char ch1 = query[i];
			char ch2 = rec[j];
			if (ch1 == ch2)
				similar = 2;			

			int upValue = calc(i - 1, j, dk_1, recLen); // another simple kernel that calculats the value given 3 paramertes
			int leftValue = calc(i, j - 1, dk_1, recLen);
			int diagonalValue = calc(i - 1, j - 1, dk_2, recLen);

			value = max2Values(upValue - 1, max2Values(leftValue - 1, diagonalValue + similar));// get the max of two values
			results[bid] = max2Values(results[bid], value);
		}//else

		__syncthreads();
		if (idx < 0 || idx > end)
			continue;

		// swap dk_1 and dk_2
		dk_2[threadIdx.x] = dk_1[threadIdx.x];
		dk_1[threadIdx.x] = value;

		int d = i + j;
		//increment the tid
		if(d < recLen - 1)
			tid++;
		else
			tid += recLen;
		// Synch
		__syncthreads();
	}
}

int main()
{
   // some code 
int it = 0;
	char *currentDBChar = host_1D_Array_pre;
	float total_elapsed=0.0;

	while (it < *numSeq)
	{
		int globalBytesSum = 0;
		int sharedBytes = 0;
		int recordsBytes = 0;
		int maxRecordLength = 0;
		float elapsedTime=0.0;

		int begin = it;		

		while (it < *numSeq)
		{
		int recordLength = protein_length_pre[it];
		int globalBytes = recordLength + 3 * 4; //the recordlength+ passing 3 int arrays 

		int blocksCountTillNow = it - begin;
		int temp = (recordLength + min(hostQueryLength, recordLength) * 4 * 2)*(blocksCountTillNow + 1);
		
		if (temp > SHARED_MAX_BYTES || globalBytesSum + globalBytes > GLOBAL_MAX_BYTES)
			break;

			recordsBytes += recordLength;
			sharedBytes = temp;
			globalBytesSum += globalBytes;
			maxRecordLength = recordLength;
			it++;
		}

		// Call the kernel

		int blocksCount = it - begin;
		int recordsCount = blocksCount;
		int threadsCount = hostQueryLength;
		
		char *dev_records;
		cudaMalloc(&dev_records, recordsBytes * sizeof(char));
		cudaMemcpy(dev_records, currentDBChar, recordsBytes * sizeof(char), cudaMemcpyHostToDevice);
		currentDBChar += recordsBytes;

		int *dev_lengths;
		cudaMalloc(&dev_lengths, recordsCount * sizeof(int));
		cudaMemcpy(dev_lengths, protein_length_pre + begin, recordsCount * sizeof(int), cudaMemcpyHostToDevice);

		int *dev_offsets;
		cudaMalloc(&dev_offsets, recordsCount * sizeof(int));
		cudaMemcpy(dev_offsets, protein_Offset_pre + begin, recordsCount * sizeof(int), cudaMemcpyHostToDevice);

		int *host_results = (int*)malloc(recordsCount * sizeof(int));
		int *dev_results;
		cudaMalloc(&dev_results, recordsCount * sizeof(int));

				//Calling the kernel
		Smith_Waterman_Kernel(dev_records, dev_lengths, dev_offsets, dev_results, maxRecordLength);

	cudaMemcpy(host_results, dev_results, recordsCount * sizeof(int), cudaMemcpyDeviceToHost);
		free(host_results);
		cudaFree(dev_records);
		cudaFree(dev_lengths);
		cudaFree(dev_offsets);
		cudaFree(dev_results);

	}
       printf("Total Elapsed time : %f ms
" ,total_elapsed);
	//Free Host Memory
   return 0;
}

njuffa · January 9, 2013, 10:03pm

cuda-memcheck can pinpoint out-of-bounds memory accesses in kernels.

Lev · January 9, 2013, 10:17pm

Why not to debug your program to see what is the problem?

Lev · January 9, 2013, 10:21pm

What does it mean? for(int k=0; k end)) ;

Heba_Khaled · January 9, 2013, 11:08pm

for(int k=0; k end)) ; was just a typing error … I modified it above…
and yes I debugged the code but it didn’t go inside some threads.

Lev · January 9, 2013, 11:47pm

I suggest to use old cuda version with emulation mode, so you can check all threads and debug your algorithm.

Lev · January 10, 2013, 12:36am

Also consider cudaprintf to get something from running kernel. Btw, maybe something like windows timer kills the kernel. Also check error codes from functions.

kbam · January 10, 2013, 11:54pm

Should test for errors when launching the kernel and when copying data to/from the device.
[url]Programming Guide :: CUDA Toolkit Documentation
[url]CUDA Runtime API :: CUDA Toolkit Documentation

Also the manual says “Any call to a global function must specify the execution configuration for that call” though possibly it defaults to something if not set.
[url]Programming Guide :: CUDA Toolkit Documentation

njuffa · January 11, 2013, 3:15am

The missing launch configuration in the posted code is presumably the fault of the forum software. It has a bug that causes it to remove text that follows a “less than” character. In this case it removed everything up to and including the matching “greater than” character, i.e. the entire launch configuration. The same issue affects the “less than” / “greater than” bracketing for #include. If you have a comparison that includes “less than” in your code it tends to remove the entire code after this character.

Sorry for this big inconvenience, I know from my own experience how confusing / annoying it can be. I have reported the various aspects of this issue more than once but it seems a fix hasn’t been deployed yet.

kbam · January 11, 2013, 5:35am

Thanks Njuffa, I had experienced this with another post but thought it was a copy paste problem.
However I notice that some people appear to have managed to work round it, is there a established easy work around ?

Cheers

njuffa · January 11, 2013, 6:06am

I am not aware of a workaround. If someone knows how to work around this please let us know. I have gone as far and inverted a comparison in the code from “less than” to “greater equal” to get code posted in the forums. The problems with “less than” happen regardless of whether I type text by hand or cut & paste.

Heba_Khaled · January 12, 2013, 8:52pm

thank you all very much… I’m working on the solution hopefully I can make it run correctly :). and yes the missing lunch configuration is from the forum. I already configured grid size, block size and shared memory in my global function call.

Topic		Replies	Views
One question regarding shared memory CUDA Programming and Performance	5	1234	April 24, 2013
MMU fault when debugging .. any clue CUDA Programming and Performance	0	882	April 27, 2013
Memory management issues Global and Shared memory management CUDA Programming and Performance	12	3868	March 2, 2009
Memory Checker detected 12 access violations. error = access violation on load (global memory) CUDA Programming and Performance	6	1582	January 21, 2018
shared memory problems CUDA Programming and Performance	12	4981	October 2, 2010
17x drop in Cuda performance When each thread operate on subset of kernel input data CUDA Programming and Performance	7	1683	April 16, 2012
an illegal memory access was encountered CUDA Programming and Performance	7	63657	November 10, 2017
Maximum number of instruction inside a Kernel CUDA Programming and Performance	9	2814	October 13, 2009
CUDA kernels keep on crashing CUDA Programming and Performance	6	3643	October 27, 2008
Global memory "leak" after register spill? CUDA Programming and Performance cuda , kernel	16	62	September 3, 2024

memory access error

Related topics