I just finished the proof-of-concept port of some numerically intensive code to CUDA; now I’m starting to plan the real production version. This immediately raises the issue that a certain number of global variables and, ideally, some small global arrays in constant memory would be nice. As has been discussed here several times device and constant variables have the scope of the compilation unit and are not global. I actually can live with that; however, from hints in this forum and on stackoverflow, as well as the code of the OpenMM project (SimTK: OpenMM: Project Home), I have come up with a working hack.
The only annoying thing is that cuda-memcheck yells, and I would like to understand why. I attach the full source code, but let me outline the pertinent steps. Clearly, it is nonsensical to make all this effort for just an int, but the same works for a structure that, e.g., hold several global variables, or even pointers to device memory allocated with cudaMalloc and friends, effectively giving you dynamically allocated global arrays (this is heavily used in OpenMM). I have tested the principle outlined below in real work code, and so far I get correct results, but if I run the code under cuda-memcheck it gets killed immediately. I am working with version 3.2 of the toolkit (64bit Linux), but a variant of the code below compiles and runs correctly even with version 2.3 of the toolkit.
I have three files (main.cu, this.cu, soso.cu); a fourth source file (mydef.cu) contains the usual error checking routines / macros and is irrelevant to the following. Of course, there are corresponding *.h files as well. The main routine (in main.cu) calls routines in this.cu and soso.cu. In this.cu I define
static __device__ int i_gpu;
However, I also want to access i_gpu in soso.cu. Hence I use cudaGetSymbolAddress to get the address of i_gpu in this.cu.
void this_init(int value, int** ip)
{
HANDLE_ERROR(cudaMemcpyToSymbol( i_gpu, &value, sizeof(int) ));
HANDLE_ERROR(cudaGetSymbolAddress( (void**)ip,"i_gpu" ) );
printf("From CPU this : i_gpu address after = %p\n",*ip);
}
The pointer holding this address is passed via the main program to a routine in soso.cu, i.e.
int *ip = NULL;
this_init(i,&(ip));
soso_work(ip);
and, voila, soso_work effectively knows i_gpu:
__global__ void soso_kernel(int *ip1)
{
printf("soso: this addr. = %p\n",ip1);
printf("soso: this value = %d\n",*ip1);
*ip1*=2;
printf("soso: this addr. = %p\n",ip1);
printf("soso: this value = %d\n",*ip1);
}
void soso_work(int *ip1)
{
soso_kernel<<<1,1>>>(ip1);
checkCUDAError("soso_kernel:");
HANDLE_ERROR(cudaThreadSynchronize()); // in case of printfs in kernel code
}
After this call routines in this.cu see the modified value of i_gpu, as expected. All this works analagously with constant instead of device, although one could argue that it’s not polite to change something declared as constant. If I run the executable with cuda-memcheck, an error is triggered at the first printf, i.e., even a read appears to be illegal.
I have the following questions: (i) Am I overlooking something; i.e., is there a small change that would make cuda-memcheck happy. (ii) Does the code trigger a limitation of cuda-memcheck? (iii) Or am I doing something really dangerous that could break with new versions of the toolkit?
Thanks in advance for suggestions, hints and explanations,
Stefan
PS: Compile the full code with
nvcc -arch=sm_20 -g -G -o mytest main2.cu this.cu soso.cu mydef.cpp
Obviously for compute capabilities < 2.0 one has to switch to cuPrintf or something like this, but the principle remains the same!
global_test.tar.gz (32.6 KB)