Pointer to CudaMem Using allocated Memory in different Kernels


iam using 2 kernel, which are synchronized, the first one allocate the momory on gpu:

extern “C” void X(…, float **a_d)
size_t MEMsize = widthheightsizeof(float);
unsigned int ArraySize = width * height;

// allocate array on Device
(cudaMalloc((void **) a_d, MEMsize));

this will be called from c code:
X(…, &a_d);

doing so i got the adress of the memory in a_d within the c-code. and use it when calling the second kernel

X2(…, a_d);

within this second kernel i would like to acces the data in the GPU memory. its seems to work in the Emulation Mode, but changing to CUDA mode, it doesnt work.

so is there something to take care about?
to avoid this situation i could copy the data just again within the second kernel, but that sounds inefficient.

any suggestions?

It doesn’t seem, from your code snippets, that you are actually using two kernels - what you described is a host function to allocate memory, and then a kernel call. The sequence of operations look OK, but - are you checking cudaMalloc() result for errors (maybe the amount of memory you are requesting there is too big)? If not so, could you post more complete example?

Ok, thanks for the fast reply.
i’ill try to give you a overview:

c-code header:
extern “C” void X(…, float* a_d);
extern “C” void X2(…, float** a_d);

X(…, &a_d);
X2(…, a_d);

cuda.cu code:

extern “C” void X2(…, float* a_d)
// execute the kernel
dim3 block(8, 8, 1);
dim3 grid(mesh_width / block.x, mesh_height / block.y, 1);
X2_kernel<<< grid, block>>>(…, a_d);


extern “C” void X(…, float **a_d)

size_t MEMsize = width*height*sizeof(float);
unsigned int ArraySize = width * height;

// allocate array on Device
(cudaMalloc((void **) a_d, MEMsize));

// copy data from host to device
( cudaMemcpy(*a_d, inputBuffer, MEMsize, cudaMemcpyHostToDevice));

  //execute kernel
 X_kernel <<<  nBlocks, BLOCK_DIM>>> (....., *a_d, (ArraySize));

[b](cudaMemcpy(outputBuffer, *a_d, MEMsize, cudaMemcpyDeviceToHost));



global void X_Kernel(…, float a, int const size)
int Xidx = blockIdx.x
blockDim.x + threadIdx.x;

if (Xidx < size)
	a[Xidx] = a[Xidx] * 0.5f; //for example


global void X2_kernel(…, float* a_d)
unsigned int x = blockIdx.xblockDim.x + threadIdx.x;
unsigned int y = blockIdx.y
blockDim.y + threadIdx.y;

 float test = a_d[0]; //ACCESING MEM in GPU, allocatet in X()


In emulation Mode i can set brakepoints an watch the values of the mem pointers in X() and X2(), and see they are the same.
but its not working in cuda mode.

i copy some audio data and i think its working ok, because i got the correct sound output, from the first kernel X().
but the second one isnt working alright. the adiodata copied is 2(channel) * 1024 (blocklength) * sizeof(float)
that shouldnt be to much.

Seems to me again that everything is looking OK with regards to “a_d” pointer handling. I’d still check for error return from all cuda*() calls, and if there are no problems there, then it’s probably something within your second kernel code. So to debug it, it would be still good to have a minimal, but fully working code sample.

mh the first kernel is called by a VST_Plugin and the second one as a function of a OpenGL, which is initialized by the Plugin.

i cant give you a working example. :(

at the moment the hole system is completly frozen when executing in non emulation mode and i need to reset.

its all about this line: float test = a_d[0]; //ACCESING MEM in GPU, allocatet in X()
its working when i dont try to access this.

the rest of the kernel is ok.

so debugging in “cuda-mode” is a problem, because i dont get to the _kernel.cu functions, and it cant get the values of the pointer a_d.

thank you so much for helping me out here.

I guess you may already have tried this, but - can you add, after doing the sync after the first kernel completed, some host code to copy the contents of the device memory back to host, and to check is everything OK with these values (and then you could try with another cudaMalloc(), and then copy these values back to the GPU memory, but now to this freshly allocated segment of memory, so that you could somewhat judge, when the second kernel run, is it about corrupted memory, or this kernel code)? Also, if you’re on Linux, or have access to a Linux machine, can you try with CUDA debugger?

i forgot something.

where shloud i call cudafree?

i tried to put it at the end of X2() in cuda.cu: cudafree(a_d)

but i get a bunch of exceptions.

First-chance exception at 0x7c812afb in mulch.exe: Microsoft C++ exception: cudaError at memory location 0x0773fd2c…

i already copy back to the host, thats why i guess the first kernel is ok, because thats my audio stream. please look in my second post, i forgot one line in cuda.cu x() where i copy back to host.
i will try to reallocate it in the second kernel.

i dont have access to a linux machine.

I’d say calling cudaFree() after X2() call completed (or at the end of X2(), albeit the whole practice of allocating memory in one routine, and freeing it some other routine isn’t particularly nice) is the right way to go (but do check that you don’t have multiple cudaFree() calls around). Further - I don’t have much Windows experience, but seems like it is indeed something about memory corruption; if you skip your second kernel call, and cudaFree() is working OK in that case, then at least you’ll know that second function/kernel is messing something up; also, if you could get to the cudaFree() error number in the case of that exception throw, then it would be something to start with…

mh i dont know, now i am making a new cudamemcpy in the second kernel, thats working.
using the pointer to an existing context, doesnt work for me :(
(though it does work in emulation mode)

it seems that the pointer a_d gets corrupted.