I want to try and port one of my algorithms to CUDA. I have never programmed with CUDA before and was wondering if someone could point me in the right direction regarding a few queries that I have:
I have to load some 3D volumes. I have access to a GTX 285 card, so memory should not be an issue. What is the best way to load 3D image data for processing.
I need to do the following:
I will not write anything into them but just read blocks of data from this memory.
Access blocks of 3D data from this image and do some processing.
I was going through the Dr. Dobbs CUDA tutorial and got very confused with all the different memory types. Should I load this data in constant, shared or texture memory? Given that I will do frequent access and need read only access, what do you recommend?
There is only 65kb of constant memory, so i doubt your data will fit in there. It is rarely the fastes option anyway, unless every thread needs to access the same data at a given time.
You cannot load data directly to shared memory from system memory. It first has to go through global memory. In global memory, data can be arranged as a texture (1d, 2d or 3d) or as a “flat” array. Read on the benefits of texture memory in the programming guide. Without knowing more about your problem id say it s a good candidate.
Shared memory could also work well, depending on the problem. There is only 16kb of it per multiprocessor but it can act as a really quick buffer near the processors if you can find a good use for it. So if you can do your processing on “chucks” of the image in a parallel way, aka many threads need access to the same 16kb of image data at the same time, shared memory should be somehow used.
Thanks for the reply Alilleur. Really appreciate it.
My blocks need to be accessed by only one thread at a time. So, I do not really need multiple threads accessing the same data. My problem is as follows:
Load two 3D images into memory.
Start a series of loops which reads a single block of memory from each image
Do some statistical calculations.
What I am hoping is that I can have multiple threads on the GPU reading the blocks and processing them at the same time. Currently, my CPU loops look as follows:
[codebox]
for (…)
{
for(…)
{
for(...)
{
// read the block from first image
for (...)
{
for(...)
{
for(...)
{
// read the block from second image
}
}
}
}
}
}
[/codebox]
So, I am hoping this could be made really parallel on the GPU, but struggling to figure out exactly how!
Install the SDK and look at the simple3dTexture exemple.
Also, from your CPU code, it looks like you could benifit from shared memory a lot.
The inner triple loop goes over all of the image again (as far as i can tell) so you basicaly have an n-pair problem where every pixel has some correlation with every other pixel.
Shared memory can help you reduce the number of times you fetch each pixel from global memory. If you have a 128 (or whatever) thread block size, then you will only need to load a given pixel once from global memory and then store it in shared memory to make it available to all 128 threads of that block.
If you are reading the image “in order”. That is, going through all the x’s then all the y’s then all the z’, your memory operations would most likely be coalesced (see the programming guide if this word is new) and therefore would be better not to use textures.
So in essence, your outer triple loop will be handled by the hardware, aka you will launch one thread per pixel. Your inner triple loop will be done inside the kernel.
Thats how i would do it at first anyway.
However, did you not say that the shared memory is limited and I cannot move data directly into shared memory. So, the image initially needs to be read into the global (texture) memory, right?
Also, what is happening in my loop is as follows:
In the outer loops, I load a selected chunk of data from one image.
In the inner loops, a chuck of data in the given neighborhood is loaded from the second image. What is happening is that a number of blocks of data is loaded from the second image and compared to the first block from the first image.
So, do you think I would benefit from using shared memory in this scenario as well?
Tought for me to say without knowing the problem completly since it will depend on the size of the chunks and your definition of “neighbourhood”.
Im am assuming that the neighbourhood at pixel (i,j) is pretty similar to the neighbourhood at pixel (i+1,j). So you would only need to load an “enlarged” neighbourhood once from global to shared memory.
You cannot move data directly from your system ram to shared memory is what i said. i.e. you cannot do a cudamemcpy to shared memory.
What you can do is, inside your kernel, copy from global memory to shared memory.
The goal here is to reduce the number of memory transactions going from the processors to global memory, and this is what we’re using shared memory for. If you load that “super” neighbourhood only once from global memory to shared memory for a number of pixels.
And yes, shared memory is limited (see the programming guide in the appendix) to 16kb. So as ive said, it will depend on your definition of neighbourhood.
What i would suggest, since this is most likely your first cuda implementation, is to start by actually forgetting all ive said about shared memory. Start by doing it all from global memory, just to get things working. Then, from that point, when you actually get the results youre looking for at the end, do try to move the redundant memory transactions to shared memory. But it will be a lot easier to debug if you forget all about shared memory for your very first try at implementing a cuda application.