cuda memory allocation cuda memory allocation outside the processing loop

Hello Everbody,
I have a strange problem and i hope i have come to the right place to find some help.
For my project i am trying to use CUDA to do flat field correction on an incoming sequence of images. Currently in my setup i have created a CUDA dll in visual studio and calling this DLL from labview. LAbview is aquiring the images from the frame grabber and sending it to GPU via the CUDA dll that i created for processing. Everything works fine the flat field correction is also fine . However my current implementation is not efficient. For the correction i need two constant images the flat field and the dark field. In the current setup every time the dll is called i copy the flat field and the dark field to the device memory for calculations . However since these images are constant is there a way i can copy these images to the device memory outside the image acquisition loop and then after the acquisition is done free the device memory. In other words i want to do a cudamalloc and a cudamemcpy of the two constant images outside the acquisition loop. Use the device ptrs in the acquisition loop and then free them at the end.


Here’s an idea that might not be efficient or easy to code, but it’s the first that comes to mind:

When your dll is loaded for the first time have it spawn a new process (on your cpu) that allocates the memory and stores the images. Then you can use some interprocess communication like a socket to request the pointers to the images. When you are all done be sure to kill the process.

This is a roundabout method and depending on your experience with OS programming may be too difficult.

I’ll think about it more. There’s probably an easier way.

Thank you for your response. I will read about it and see how it goes. I am an electrical engineer with very less if not nil experience in OS programming but i will give it a shot…

I also had one more question. In every loop each time the dll is called i allocate around 6 memory locations on the CUDA device using cudamalloc. So in the dll i just had 6 cudamalloc and corresponding cudafree calls nothing else. This itself is taking around 14-15 ms . Is this how much its supposed to take to allocate memory on the device. I read on this forum that the initial cudamalloc takes time as the deivce has to get initialized. So am i correct in assuming that every time the dll is called in the loop the device gets initialized.



I am also new to CUDA and labview. Can you pls help me by sending the .cu file that you use in Visual studio to make a dll file for labview? I am really having trouble creating a dll file to be imported in labview. Pls help.