New dynamic shared memory allocation in CUDA 5?

Shura58 · November 7, 2012, 12:42am

Hi all,
I’m having trouble migrating my code from CUDA 4.2 to 5.0. The dynamically allocated shared memory specified in the kernel call doesn’t seem to be available to the device functions. I declare a shared structure with an internal array using the allocated shared memory. When trying to access this array from within a device function, I get cuda-memcheck errors.
Has anything changed in the way dynamic shared memory is specified? Are there any new restrictions on which functions can use pointers to this memory?

Many thanks for any hints.

Below is an example that reproduces the problem. CentOS 6.3, x86_64, CUDA 5.0, driver 304.51.

/* gpu.cu */
#include
#include “gpu.h”

device shared cuda_model model;

device void populate_model(cuda_model* m){
if (threadIdx.x < 16){
m->matrix[threadIdx.x] = threadIdx.x;
}
__syncthreads();
}

global void compute(){

extern __shared__ short shared_data[];
model.matrix = (int*)shared_data;
populate_model(&model);

if (threadIdx.x == 0){
	for (int i = 0; i < 16; i++){
		printf("	%d

", model.matrix[i]);
}

	printf("Completed

");
}

}

host void run_kernel(){
compute();
cudaDeviceSynchronize();
}

/* main.cpp */
#include
#include “gpu.h”

int main(int argc, char* argv){

run_kernel();
printf("kernel completed

");

return 0;

}

/* gpu.h */
#ifndef GPU_H_
#define GPU_H_

void run_kernel();

struct cuda_model{
int* matrix;
};

#endif /* GPU_H_ */

cuda-memcheck:
[sasha@gpudev Debug]$ cuda-memcheck ./Test
========= CUDA-MEMCHECK
kernel completed
========= Invalid global read of size 8
========= at 0x00000090 in /home/sasha/cuda-workspace/Test/Debug/…/gpu.cu:8:populate_model(cuda_model*)
========= by thread (15,0,0) in block (0,0,0)
========= Address 0x01000000 is out of bounds
========= Device Frame:/home/sasha/cuda-workspace/Test/Debug/…/gpu.cu:19:compute(void) (compute(void) : 0x128)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/libcuda.so (cuLaunchKernel + 0x3dc) [0xc9d3c]
========= Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 [0x11d54]
========= Host Frame:/usr/local/cuda-5.0/lib64/libcudart.so.5.0 (cudaLaunch + 0x182) [0x38152]
========= Host Frame:./Test [0x96d]
========= Host Frame:./Test [0x9f9]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xfd) [0x1ecdd]
========= Host Frame:./Test [0x789]

njuffa · November 7, 2012, 1:19am

The following looks dangerous:

extern __shared__ short shared_data[];
model.matrix = (int*)shared_data;

In device code, all memory accesses must be aligned to natural boundaries, i.e. shorts are aligned on 2-byte boundaries, ints and floats are aligned on 4-byte foundaries, doubles are aligned on 8-byte boundaries. Converting a pointer to one type into a pointer to a type with tighter alignment requirement (as you do here) is therefore asking for trouble, as proper alignment for access throuhg the second pointer cannot be guranteed. In general, I would suggest to declare the extern shared object to be of the widest type used, and sort the various elements in order of descending width.

What happens when you switch to “extern shared int shared_data;”

Shura58 · November 7, 2012, 1:23am

The same happens with the int version. And yes, I’m aware of the type issue, and in the actual project code I cover for that by adjusting the offsets. There, I need to accommodate a variety of types from short to double resident in shared memory, hence the lowest common denominator for the extern array…

Have you been able to reproduce the problem? I tried to make the code really compact, so that it could be tested easily…

Thanks

njuffa · November 7, 2012, 1:41am

If you believe this to be a bug in CUDA, please file a bug report through the registered developer website.

Shura58 · November 7, 2012, 5:23pm

Well, I’m not sure it’s a bug, since there may have been changes in the API as of CUDA 5.0 release. Can someone from Nvidia confirm that the syntax I’m using is correct? I’d rather fix the issue in my code and move on than wait for the bug report outcome.

Thanks

vyas · November 8, 2012, 6:11pm

Can you try compiling your application with “-O2 -lineinfo” instead of with “-G”, and then running cuda-memcheck on it ?

Topic		Replies	Views
In-function initialization of a shared struct produces a memcheck error CUDA Programming and Performance	0	725	March 13, 2013
Bug in cuda-memcheck? CUDA Programming and Performance	2	1415	March 14, 2013
Memory problem with dynamic array in kernel CUDA Programming and Performance	2	2525	March 11, 2016
CUDA_ERROR_ILLEGAL_ADDRESS CUDA Programming and Performance	6	10842	September 26, 2017
How to implement a generic array in unified memory? CUDA Programming and Performance	7	2027	July 2, 2017
passing an array to a kenel ? CUDA Programming and Performance	9	13168	June 10, 2009
Problem in getting __device__ dynamic array data back to host CUDA Programming and Performance	13	2136	January 18, 2013
Memcheck error accessing heap memory in non-default stream CUDA Programming and Performance	6	777	January 10, 2018
strange behavior with device emulation CUDA Programming and Performance	5	2693	May 20, 2008
How can I configure this problem is it too big to fit in shared memory? CUDA Programming and Performance	7	3737	October 14, 2008

New dynamic shared memory allocation in CUDA 5?

Related topics