Ive just started working with CUDA. Im connectiong it with CUDA.NET (http://www.gass-ltd.co.il/en/products/cuda.net/) and have started on a framework to manipulate images. However, I have a problem as I do not know which times I should expect and therefore Im not really aware if Im close to good or bad.
I am copying a byte[321,481] array from Vista 64bit to GT9800, which takes (3 runs) 26,14 and 24ms
as far as I can see, this is:
//Number of bytes transfered:
bytes = 321*481
mean_ms = (26+14+24)/3
Have you tried copying anything larger (just for testing purposes)? Since you’re only copying ~150K, I suspect that a large part of the ‘transfer time’ you measured is actually due to the interop overhead (GASS’s CUDA.NET library is basically just a wrapper for handling the interop with the CUDA driver methods).
If you try transferring something large (say, 100MB) and you’re still getting really low bandwidth like that, then you might have something to be worried about.
The way I see it, there’s four options for you here.
Copy a bunch of pictures at one time. In .NET, you can serialize the image to an RGBA byte array, so if you are pulling in a bunch of images for processing via CUDA, you would just allocate a very large byte array (again, like 100MB, or whatever batch size you want to use), copy all of the image data in there (sort of like a packed struct), and just “remember” (i.e. record in a list or array) what the array index for each image is. Then you can pass the image data array and the ‘array index’ array to the device, where you will be able to process multiple images in parallel.
Do some serious image processing on the image data you’re passing in (FFT’s, other convolutions, reductions, etc.). If your processing kernel takes 500msec to run on the device for one image (doubtful it would be this long, but you get the idea), then the interop overhead time isn’t going to make a big difference. If it does, write some kind of wrapper DLL that reads your images in from memory/disk passes them to the kernel and gets the results back, then just use a C# GUI to control that DLL via interop (since you won’t get the interop overhead when calling CUDA from natively compiled C++).
Use the streaming methods in CUDA. I honestly don’t know how much (if any) they will ‘hide’ the interop overhead, but it might be worth a shot if #1 and #2 don’t work for you.
Don’t use CUDA. If you’re only processing a single small image here and there, it’s probably not worth the time it would take to port to CUDA, since you won’t really be using any of the GPU’s power. You’ll also probably get more of a time penalty from the interop overhead per image than any speedup you would gain from moving the calculations out of pure C# on the CPU.
I’m going to use 200 images (15mb) together with Genetic Programming, so there will be a a lot of calculations. I think your solution of copying all the images to the device is a good idea. So… now Ive just got to figure out the seperableConvolution example :)