I am having some unusual memory allocation problems when trying to pass device pointers to and from libraries. My goal is to have externally defined functions in libraries that can use the device pointers my main application creates so I can leave the data on the graphics card between function call to various libraries. I’ve included some simplified code that illustrates the problem:
This is my CUDA kernel:
[codebox]extern “C” __declspec(dllexport) float* Copy2D_float(float* h_idata, int size_x, int size_y, CUcontext *pctx)
{
float* d_idata;
CUresult res;
res = cuCtxAttach(pctx, 0);
// allocate host memory
const unsigned int mem_size = sizeof(float) * size_x * size_y;
// allocate device memory
cudaError_t eResult;
cutilSafeCall( cudaMalloc( (void**) &d_idata, mem_size));
eResult = cudaMemcpy( d_idata, h_idata, mem_size, cudaMemcpyHostToDevice);
cudaThreadExit();
res = cuCtxDetach(*pctx);
return d_idata;
}[/codebox]
This kernel is simply trying to use the current CUcontext provided by the main app, copy the host data to the device, detach from the CUcontext, and return the device pointer.
The main program below is written in C# and uses the publically available CUDA.NET library which is a thin wrapper around the cuda API.
[codebox]
[DllImport(@"External_Library.dll")] //This is just the C# way of calling exported "C" functions
public static extern IntPtr Copy2D_float(IntPtr h_idata, int size_x, int size_y, IntPtr Context);
static unsafe void Main(string args)
{
CUDA cuda = new CUDA(0, true);
CUdeviceptr d_idata, d_odata;
CUcontext ctx = cuda.CreateContext(0);
// allocate host memory
int size_x = 1024;
int size_y = 1024;
float[] h_idata = new float;
float[] h_odata = new float;
// initalize the memory
for(int i = 0; i < (size_x * size_y); ++i)
{
h_idata[i] = (float) i;
}
//Allocate and copy data to device
fixed (float* pH = h_idata)
{
IntPtr ptr = Copy2D_float((IntPtr)pH, size_x, size_y, (IntPtr)ctx.Pointer);
d_idata.Pointer = (uint)ptr;
}
d_odata = cuda.Allocate<float>(new float);
//Copy data back to host
cuda.CopyDeviceToHost<float>(d_idata, h_odata);
cuda.Free(d_idata);
cuda.Free(d_odata);
cuda.DestroyContext(ctx);
}[/codebox]
In this example the returned d_idata pointer is assigned the device address 0x03020000, but the main context (even though it was passed to the library call) doesn’t seem to acknowledge this so the address of d_odata is also allocated to 0x03020000. If you then copy the d_odata back to the host is has the values of the h_idata you intended to copy to d_idata so I know the copy is really happening. Next I switch the code around a bit so the allocate block looks like this:
[codebox]
// allocate device memory
d_odata = cuda.Allocate<float>(new float);
fixed (float* pH = h_idata)
{
IntPtr ptr = Copy2D_float((IntPtr)pH, size_x, size_y, (IntPtr)ctx.Pointer);
d_idata.Pointer = (uint)ptr;
}
[/codebox]
Now I allocated d_odata first which again got the device address 0x03020000 and then allocated d_idata which now received the address 0x03420000. The problem is that even though the d_idata now has a different address than the d_odata, when I try to copy the d_idata back to the host to make sure everything is working properly it crashes on the cuda.CopyDeviceToHost(d_idata, h_odata);. The next thing I tried is a real hack, but seems to make it work for some reason:
[codebox]
// allocate device memory
d_odata = cuda.Allocate<float>(new float);
fixed (float* pH = h_idata)
{
IntPtr ptr = Copy2D_float((IntPtr)pH, size_x, size_y, (IntPtr)ctx.Pointer);
d_idata.Pointer = (uint)ptr;
}
CUdeviceptr d_junk = cuda.Allocate<float>(new float);
[/codebox]
So what happens here is that the d_odata is allocated to 0x0302000 on the device by the main function and works fine. The d_idata is allocated to 0x03420000 by the library. I then create a junk device pointer in the main program and allocate it which assigns is the address 0x03420000 (same address as d_idata). Now when I run the app everything works fine. It seems in actual copying of data to the device by the external library is really happening, but that the main app can’t really use it until its been “validated” by allocating on top of it by the main app.
Just to reiterate, I’m not really trying to create an external app that can copy data for me, but this demo app has the same issues with it that my other attempts at merging cuda functions from external libraries are having. Thanks