extern __shared__ variable : error at kernel return Error while returning kernel due to an extern _


I get an error during runtime at the return of a kernel using an extern shared variable.

This is note a useful program though but just a code to test how different memories should be used.

Here are my definitions:

extern __device__ __shared__ int tab[];

__global__ static void sort (int *table)


	int start;   






	int idx = blockIdx.x * blockDim.x + threadIdx.x; 		

	if (idx < 16)


		start = shell[idx];

		shellKn (start); 


	// if last loop, copy the tab content in table

	if(idx == BLCKNB*BLCKTH - 1){

		for(int i = 0; i < size; i++)


			table[i] = tab[i];





As I explained, the error is after the return line. This must be due to a problem while the shared variable memory is free.

Note: BLCKNB*BLCKTH WILL definitely be equals to 16 (the error is not from that).

I have searched on different website and pdf included the developping guide but I can’t figure out why I got this error.

Thanks for your help!

Shared memory variables are strictly block scope and must be declared inside global functions.

Also the concept of extern in this context is meaningless. All CUDA declarations are file scope only. You cannot declare device variables extern, and there is no linking of device code that could make externally declared symbols work, even if it was permitted in the language.

Indeed, if you want to declare a variable in shared memory as an external array, the only way to to this is:

extern _shared_ float shared[];

I’m quoting the programming guide.

As far as the device parameter is concerned, even though I remove it, I have the same error (which is not surprising).

Any other idea?

Yes but it is where you are trying to declare it that is wrong. The declaration must be inside the global function to which is it attached.

This is wrong and won’t work:

extern __device__ __shared__ int tab[];

__global__ static void sort (int *table)




This is the correct scope (and changes the meaning extern in the process).

__global__ static void sort (int *table)


	extern __shared__ int tab[];



The difference between the two is subtle but important.

I’m afraid you are misinformed. nvcc does support extern shared declarations outside the kerne, and the programming guide does state that it is possible. While I agree that it makes more sense to declare it inside the kernel, I do find myself copying and pasting old code that has it outside even now: proof https://codeblue.umich.edu/hoomd-blue/trac/…uteThermoGPU.cu - compiles and works 100% correct with CUDA 3.0.

To the OP:

Are you certain that you are requesting the proper amount of extern shared memory in your kernel launch:

kernel<<<grid, threads, shared_mem_bytes>>>(...)

This is a common error I often make that leads to run time errors or incorrect behavior in device code using extern shared memory arrays.

So the problem is not coming from the declaration or the use of this variable? (Note: the execution gives the good values, the problem is just the return exception).

I’m still working in emulation mode (so sad <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />) and I have chosen:

#define BLCKNB 64

#define BLCKTH 128

sort<<<BLCKNB, BLCKTH>>>(device_result);

I don’t know. I didn’t take the time to try and understand what your code is doing. Just offered up a common cause of errors with this type of shared memory…

Which clearly you have this error. Without a 3rd parameter in the execution configuration, there will be 0 bytes allocated for the extern shared array. Meaning that all of your blocks will be writing data to random memory locations and may or may not produce correct results. On GTX 480, this will produce an unspecified launch failure.

What a shame :"> I first putted a third parameter in my kernel call and remove it since I was not using shared memory.
But I forgot to add it while I chosed to use shared variable. :verymad:

It’s working great now, thanks a lot !