New dynamic shared memory allocation in CUDA 5?

Hi all,
I’m having trouble migrating my code from CUDA 4.2 to 5.0. The dynamically allocated shared memory specified in the kernel call doesn’t seem to be available to the device functions. I declare a shared structure with an internal array using the allocated shared memory. When trying to access this array from within a device function, I get cuda-memcheck errors.
Has anything changed in the way dynamic shared memory is specified? Are there any new restrictions on which functions can use pointers to this memory?

Many thanks for any hints.

Below is an example that reproduces the problem. CentOS 6.3, x86_64, CUDA 5.0, driver 304.51.

/* */
#include “gpu.h”

device shared cuda_model model;

device void populate_model(cuda_model* m){
if (threadIdx.x < 16){
m->matrix[threadIdx.x] = threadIdx.x;

global void compute(){

extern __shared__ short shared_data[];
model.matrix = (int*)shared_data;

if (threadIdx.x == 0){
	for (int i = 0; i < 16; i++){
		printf("	%d

", model.matrix[i]);




host void run_kernel(){

/* main.cpp */
#include “gpu.h”

int main(int argc, char* argv){

printf("kernel completed


return 0;


/* gpu.h */
#ifndef GPU_H_
#define GPU_H_

void run_kernel();

struct cuda_model{
int* matrix;

#endif /* GPU_H_ */

[sasha@gpudev Debug]$ cuda-memcheck ./Test
kernel completed
========= Invalid global read of size 8
========= at 0x00000090 in /home/sasha/cuda-workspace/Test/Debug/…/*)
========= by thread (15,0,0) in block (0,0,0)
========= Address 0x01000000 is out of bounds
========= Device Frame:/home/sasha/cuda-workspace/Test/Debug/…/ (compute(void) : 0x128)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/ (cuLaunchKernel + 0x3dc) [0xc9d3c]
========= Host Frame:/usr/local/cuda-5.0/lib64/ [0x11d54]
========= Host Frame:/usr/local/cuda-5.0/lib64/ (cudaLaunch + 0x182) [0x38152]
========= Host Frame:./Test [0x96d]
========= Host Frame:./Test [0x9f9]
========= Host Frame:/lib64/ (__libc_start_main + 0xfd) [0x1ecdd]
========= Host Frame:./Test [0x789]

The following looks dangerous:

extern __shared__ short shared_data[];
model.matrix = (int*)shared_data;

In device code, all memory accesses must be aligned to natural boundaries, i.e. shorts are aligned on 2-byte boundaries, ints and floats are aligned on 4-byte foundaries, doubles are aligned on 8-byte boundaries. Converting a pointer to one type into a pointer to a type with tighter alignment requirement (as you do here) is therefore asking for trouble, as proper alignment for access throuhg the second pointer cannot be guranteed. In general, I would suggest to declare the extern shared object to be of the widest type used, and sort the various elements in order of descending width.

What happens when you switch to “extern shared int shared_data;”

The same happens with the int version. And yes, I’m aware of the type issue, and in the actual project code I cover for that by adjusting the offsets. There, I need to accommodate a variety of types from short to double resident in shared memory, hence the lowest common denominator for the extern array…

Have you been able to reproduce the problem? I tried to make the code really compact, so that it could be tested easily…


If you believe this to be a bug in CUDA, please file a bug report through the registered developer website.

Well, I’m not sure it’s a bug, since there may have been changes in the API as of CUDA 5.0 release. Can someone from Nvidia confirm that the syntax I’m using is correct? I’d rather fix the issue in my code and move on than wait for the bug report outcome.


Can you try compiling your application with “-O2 -lineinfo” instead of with “-G”, and then running cuda-memcheck on it ?