I have been working on a CUDA app to speed up some video processing routines. I recently decided to split one of the longer .cu files into multiple .cu files (it had two kernels on it). I was using some constant memory as a transformation matrix that both kernels use, but I can’t seem to make the constant memory work in more than one cu file.
This is what I have tried:
extern device constant float pix_transform; in the kernel .cu files
device constant float pix_transform; in the driver.cu file.
Strangely enough, when stepping through the program,
checkCUDAError(cudaMemcpyToSymbol(“pix_transform”, pixel_m, sizeof(pix_transform), 0, cudaMemcpyHostToDevice));
returns an invalid device symbol error.
However, when I remove the extern constant declaration for all other files, this works fine. I’m kind of at a loss, because constant memory seems to be rather not well documented, and I only got it working in the first place with some trial and error, and looking at other people’s code.
So my question is, is there a way to do this? To keep constant memory between kernels, and to keep them in different files?
Thanks in advance
There is no linker in device code, so things like texture declarations, constant memory declarations and global memory symbols have file scope only. So you can’t directly refer to a device symbol declared in one file in another. The usual work around is to write “access” host stub functions which can be called from anywhere and wrap things like texture binding and device memory symbol access. It isn’t always all that pretty, but it usually suffices.
CUDA doesn’t have a linker on the device side, so it can’t support [font=“Courier New”]extern[/font]. You could probably do with having two pointers, one in each .cu file, and initializing them to the same allocation. Then again, for just 9 floats this does not really seem worth it.
Ahh okay. Didn’t know that, thank you!
I actually have a bunch of constants, around 3 3x3/4x4 transformation matrices, 2 2x1 constants matrices, and 1 5x1 constant matrices. I’m working on a quadro 4600, so I have to comply with compute 1.0, so I’m a bit cramped for shared mem. What do you think would be my best solution for a situation like this? I think could potentially fit it all in the parameters of the kernel, but that seems kind of messy, and I might have to recalc some more stuff per kernel.
Thanks once again