global __device__ variables --- only cuda-memcheck doesn't like them

I just finished the proof-of-concept port of some numerically intensive code to CUDA; now I’m starting to plan the real production version. This immediately raises the issue that a certain number of global variables and, ideally, some small global arrays in constant memory would be nice. As has been discussed here several times device and constant variables have the scope of the compilation unit and are not global. I actually can live with that; however, from hints in this forum and on stackoverflow, as well as the code of the OpenMM project (SimTK: OpenMM: Project Home), I have come up with a working hack.

The only annoying thing is that cuda-memcheck yells, and I would like to understand why. I attach the full source code, but let me outline the pertinent steps. Clearly, it is nonsensical to make all this effort for just an int, but the same works for a structure that, e.g., hold several global variables, or even pointers to device memory allocated with cudaMalloc and friends, effectively giving you dynamically allocated global arrays (this is heavily used in OpenMM). I have tested the principle outlined below in real work code, and so far I get correct results, but if I run the code under cuda-memcheck it gets killed immediately. I am working with version 3.2 of the toolkit (64bit Linux), but a variant of the code below compiles and runs correctly even with version 2.3 of the toolkit.

I have three files (main.cu, this.cu, soso.cu); a fourth source file (mydef.cu) contains the usual error checking routines / macros and is irrelevant to the following. Of course, there are corresponding *.h files as well. The main routine (in main.cu) calls routines in this.cu and soso.cu. In this.cu I define

static __device__ int i_gpu;

However, I also want to access i_gpu in soso.cu. Hence I use cudaGetSymbolAddress to get the address of i_gpu in this.cu.

void this_init(int value, int** ip)

{

  HANDLE_ERROR(cudaMemcpyToSymbol( i_gpu, &value, sizeof(int) ));

  HANDLE_ERROR(cudaGetSymbolAddress( (void**)ip,"i_gpu" ) );

  printf("From CPU this : i_gpu address after = %p\n",*ip);

}

The pointer holding this address is passed via the main program to a routine in soso.cu, i.e.

int *ip = NULL;

this_init(i,&(ip));

soso_work(ip);

and, voila, soso_work effectively knows i_gpu:

__global__ void soso_kernel(int *ip1)

{

  printf("soso: this addr. = %p\n",ip1);

  printf("soso: this value = %d\n",*ip1);

  *ip1*=2;

  printf("soso: this addr. = %p\n",ip1);

  printf("soso: this value = %d\n",*ip1);

}

void soso_work(int *ip1)

{

  soso_kernel<<<1,1>>>(ip1);

  checkCUDAError("soso_kernel:");

  HANDLE_ERROR(cudaThreadSynchronize()); // in case of printfs in kernel code

}

After this call routines in this.cu see the modified value of i_gpu, as expected. All this works analagously with constant instead of device, although one could argue that it’s not polite to change something declared as constant. If I run the executable with cuda-memcheck, an error is triggered at the first printf, i.e., even a read appears to be illegal.

I have the following questions: (i) Am I overlooking something; i.e., is there a small change that would make cuda-memcheck happy. (ii) Does the code trigger a limitation of cuda-memcheck? (iii) Or am I doing something really dangerous that could break with new versions of the toolkit?

Thanks in advance for suggestions, hints and explanations,

Stefan

PS: Compile the full code with

nvcc -arch=sm_20 -g -G -o mytest main2.cu this.cu soso.cu mydef.cpp

Obviously for compute capabilities < 2.0 one has to switch to cuPrintf or something like this, but the principle remains the same!
global_test.tar.gz (32.6 KB)

According to Appendix B.2.4.4 of the Programming Guide, the address of a device or constant variable obtained through cudaGetSymbolAddress() can only be used in host code.

You might be able to take the address in device code

__device__ i_gpu_p = &i_gpu;

and transfer that to the host or to a variable on the device from the other compilation unit.

I have seen this restriction (and admit that I don’t understand what it means), but the address obtained by cudaGetSymbolAddress() is

identical to that obtained with your suggestion, i.e.,

__device__ i_gpu_p = &i_gpu;

. This and the fact that the program gives the expected result make me sure that I have the correct and, in principle, valid device address. I am just puzzled that cuda-memcheck yells exactly about any access to this address. As far as I see it I may not have the right to use the address (see, e.g., B.2.4.4), but for all practical purposes it seems to be valid.

Thank you for the reply!

Stefan