Couldn't write a value to shared memory Error: "unspecified launch failure"

shtv · June 2, 2010, 12:00pm

Hello!

Below i’m placing some little code. When i uncomment one line:

unsorted_thread=1;

i get an error: “unspecified launch failure”.

Why isn’t this code correct?

__global__ void check_order(elem *g_elems, sum* g_sums, int n_real,int n,int num_blocks,int num_blocks2){

	const int threads_num=blockDim.x; // number of threads in each block

	const int bid=blockIdx.x; // given block's number

  const int thid=threadIdx.x; // thread's number in given block

	const int thread_elems_num=n/threads_num; // number of elements in ech thread

	const int begin=bid*n+thid*thread_elems_num;

	extern __shared__ int absolute_shared[];

	int* shared=(int*)&absolute_shared[0];

	int thread_elems[MAX_REGISTERS_PER_THREAD];

	//

	for(int i=0;i<thread_elems_num;++i){

		thread_elems[i]=g_elems[begin+i].val;

	}

	if(bid==num_blocks-1 && thid==threads_num-1)

		thread_elems[thread_elems_num]=INT_MAX;

	else

		thread_elems[thread_elems_num]=g_elems[begin+thread_elems_num].val;

	int unsorted_thread=0;

	for(int i=0;i<thread_elems_num-1 && i<MAX_REGISTERS_PER_THREAD-1;++i){

		if(thread_elems[i]>thread_elems[i+1]){

//			unsorted_thread=1;

			break;

		}

	}

	__syncthreads();

	shared[thid]=unsorted_thread;

return;

}

avidday · June 2, 2010, 12:29pm

The code probably is correct. The block size you are using probably isn’t. Uncommenting that 1 line probably uses an extra register and puts your execution parameters out of limits. This is discussed in Chapter 4 of the programming guide. You can see the number of registers your kernel uses by passing the -verbose argument to ptxas during compilation.

shtv · June 2, 2010, 1:46pm

–verbose says nothing about block size and registers. When i uncomment this line and add (to if-block) “else” with the same line, it works. So there’s a problem with registers. How can i solve this? I want rewrite values from global memory to registers (thread_elems), next compare themselves (these registers) and write result to unsorted_thread. At the end i rewrite result from unsorted_thread to shared memory. I call kernel with one block, two threads and shared memory size=16336.

And, maybe helpful:

#define MAX_REGISTERS_PER_THREAD 9

SPWorley · June 2, 2010, 2:12pm

Run the code in Ocelot to find the exact line and cause of the error. Nexus and cuda-gdb might also work.

A likely cause: you’re not providing enough dynamic shared memory for the number of threads you’re using per block. You need to specify threadcount*sizeof(int) as your shared memory size at the kernel invocation. Perhaps you forgot this, or specified just threadcount?

Next, don’t overanalyze that one line you comment out. You’re underestimating the optimizations the compiler can perform. with that commented out, your entire program likely reduces by successive dead code reduction down to a single line or two.

Finally, I assume this isn’t really your program, since in practice it never writes anything to global memory so the whole kernel is just a no-op.

shtv · June 2, 2010, 2:41pm

I resized shared memory size (from 16336 to threadcount*sizeof(int)) but still got the same error.
It is my program, but the rest of code i commented and didn’t paste there, so you can’t see writing to global memory. (I must commented next code lines while error detecting.) But that i commented can’t be a problem’s reason, yes? OK, i’ll restore it.

I’ll try with cuda-gdb. Thanks! But are there any other ideas to solve this problem?

// I have entire project at git repository:

[url=“GitHub - shtv/pbGPUQuicksort: Like a name.”]http://github.com/shtv/pbGPUQuicksort[/url] .

avidday · June 2, 2010, 3:01pm

It most certainly does:

avidday@cuda:~/code/Jacobi$ nvcc -arch=sm_13 -Xptxas "--verbose" jacobi.cu -o jacobi

ptxas info	: Compiling entry function '_Z10JORKernelRPKfS0_S0_Pf' for 'sm_13'

ptxas info	: Used 10 registers, 808+16 bytes smem, 12 bytes cmem[0], 12 bytes cmem[1]

ptxas info	: Compiling entry function '_Z10JORKernelCPKfS0_S0_Pf' for 'sm_13'

ptxas info	: Used 8 registers, 32+16 bytes smem, 12 bytes cmem[0], 8 bytes cmem[1]

ptxas info	: Compiling entry function '_Z12JORResidualRPKfS0_S0_Pf' for 'sm_13'

ptxas info	: Used 9 registers, 800+16 bytes smem, 12 bytes cmem[0], 4 bytes cmem[1]

ptxas info	: Compiling entry function '_Z12JORResidualCPKfS0_S0_Pf' for 'sm_13'

ptxas info	: Used 6 registers, 32+16 bytes smem, 12 bytes cmem[0]

Right there you have the per thread register, shared memory and constant memory usage for each kernel. If you read the section of the programming guide I referred you to, you can calculate the block size limit for each kernel you need to launch.

shtv · June 2, 2010, 3:05pm

Yes, you’re right.

ptxas info	: Compiling entry function '_Z11check_orderP4elemP3sumiiii'

ptxas info	: Used 9 registers, 36+0 bytes lmem, 32+16 bytes smem, 4 bytes cmem[1]

Topic		Replies	Views
Causes of unspecified launch failure CUDA Programming and Performance	8	8883	July 9, 2009
Unspecified Launch Failures CUDA Programming and Performance	4	6952	March 23, 2007
Unspecified Launch Failure Memory Read CUDA Programming and Performance	7	552	August 30, 2018
shared memory bug. CUDA Programming and Performance	2	2970	July 26, 2010
unspecified launch failure simple volume initialization fails CUDA Programming and Performance	9	6221	August 26, 2007
"unspecified launch failure" - ERROR CUDA Programming and Performance	9	14323	July 19, 2011
"unspecified launch failure" runtime failure CUDA Programming and Performance	6	3329	May 9, 2009
Unspecified launch failure strange error, please help CUDA Programming and Performance	13	16304	December 31, 2007
CUDA error: unspecified launch failure CUDA Programming and Performance	0	1440	May 10, 2011
Shared memory limits and cudaError_enum How to precisely determine how much of the shared memory is CUDA Programming and Performance	5	2825	April 29, 2009

Couldn't write a value to shared memory Error: "unspecified launch failure"

Related topics