Hi all,
I’m having trouble migrating my code from CUDA 4.2 to 5.0. The dynamically allocated shared memory specified in the kernel call doesn’t seem to be available to the device functions. I declare a shared structure with an internal array using the allocated shared memory. When trying to access this array from within a device function, I get cuda-memcheck errors.
Has anything changed in the way dynamic shared memory is specified? Are there any new restrictions on which functions can use pointers to this memory?
Many thanks for any hints.
Below is an example that reproduces the problem. CentOS 6.3, x86_64, CUDA 5.0, driver 304.51.
extern __shared__ short shared_data[];
model.matrix = (int*)shared_data;
populate_model(&model);
if (threadIdx.x == 0){
for (int i = 0; i < 16; i++){
printf(" %d
extern __shared__ short shared_data[];
model.matrix = (int*)shared_data;
In device code, all memory accesses must be aligned to natural boundaries, i.e. shorts are aligned on 2-byte boundaries, ints and floats are aligned on 4-byte foundaries, doubles are aligned on 8-byte boundaries. Converting a pointer to one type into a pointer to a type with tighter alignment requirement (as you do here) is therefore asking for trouble, as proper alignment for access throuhg the second pointer cannot be guranteed. In general, I would suggest to declare the extern shared object to be of the widest type used, and sort the various elements in order of descending width.
What happens when you switch to “extern shared int shared_data;”
The same happens with the int version. And yes, I’m aware of the type issue, and in the actual project code I cover for that by adjusting the offsets. There, I need to accommodate a variety of types from short to double resident in shared memory, hence the lowest common denominator for the extern array…
Have you been able to reproduce the problem? I tried to make the code really compact, so that it could be tested easily…
Well, I’m not sure it’s a bug, since there may have been changes in the API as of CUDA 5.0 release. Can someone from Nvidia confirm that the syntax I’m using is correct? I’d rather fix the issue in my code and move on than wait for the bug report outcome.