Help with some CUDA concepts

xargon · August 15, 2009, 8:43pm

Hello everyone,

I want to try and port one of my algorithms to CUDA. I have never programmed with CUDA before and was wondering if someone could point me in the right direction regarding a few queries that I have:

I have to load some 3D volumes. I have access to a GTX 285 card, so memory should not be an issue. What is the best way to load 3D image data for processing.

I need to do the following:

I will not write anything into them but just read blocks of data from this memory.
Access blocks of 3D data from this image and do some processing.

I was going through the Dr. Dobbs CUDA tutorial and got very confused with all the different memory types. Should I load this data in constant, shared or texture memory? Given that I will do frequent access and need read only access, what do you recommend?

Thanks,

xarg

Ailleur · August 15, 2009, 10:35pm

There is only 65kb of constant memory, so i doubt your data will fit in there. It is rarely the fastes option anyway, unless every thread needs to access the same data at a given time.

You cannot load data directly to shared memory from system memory. It first has to go through global memory. In global memory, data can be arranged as a texture (1d, 2d or 3d) or as a “flat” array. Read on the benefits of texture memory in the programming guide. Without knowing more about your problem id say it s a good candidate.

Shared memory could also work well, depending on the problem. There is only 16kb of it per multiprocessor but it can act as a really quick buffer near the processors if you can find a good use for it. So if you can do your processing on “chucks” of the image in a parallel way, aka many threads need access to the same 16kb of image data at the same time, shared memory should be somehow used.

Hope this helps.

xargon · August 15, 2009, 10:46pm

There is only 65kb of constant memory, so i doubt your data will fit in there. It is rarely the fastes option anyway, unless every thread needs to access the same data at a given time.

You cannot load data directly to shared memory from system memory. It first has to go through global memory. In global memory, data can be arranged as a texture (1d, 2d or 3d) or as a “flat” array. Read on the benefits of texture memory in the programming guide. Without knowing more about your problem id say it s a good candidate.

Shared memory could also work well, depending on the problem. There is only 16kb of it per multiprocessor but it can act as a really quick buffer near the processors if you can find a good use for it. So if you can do your processing on “chucks” of the image in a parallel way, aka many threads need access to the same 16kb of image data at the same time, shared memory should be somehow used.

Hope this helps.

Thanks for the reply Alilleur. Really appreciate it.

My blocks need to be accessed by only one thread at a time. So, I do not really need multiple threads accessing the same data. My problem is as follows:

Load two 3D images into memory.
Start a series of loops which reads a single block of memory from each image
Do some statistical calculations.

What I am hoping is that I can have multiple threads on the GPU reading the blocks and processing them at the same time. Currently, my CPU loops look as follows:

[codebox]

for (…)

{

for(…)

{

   for(...)

   {

        // read the block from first image

        for (...)

        {

            for(...)

            {

               for(...)

               {

                   // read the block from second image

               }

           }

        }

    }

 }

}

[/codebox]

So, I am hoping this could be made really parallel on the GPU, but struggling to figure out exactly how!

Thanks,

xarg

xargon · August 16, 2009, 9:31am

Hello everyone,

Does anyone have an example on how to load data into global memory as a 3D texture. The programming guide does not have any, as far as I could tell.

My image data is 3D and I am guessing it would be most convenient to use a 3D texture or does it not matter at all?!

I am having so much trouble getting my head around CUDA… :(

Ailleur · August 16, 2009, 12:20pm

Install the SDK and look at the simple3dTexture exemple.

Also, from your CPU code, it looks like you could benifit from shared memory a lot.
The inner triple loop goes over all of the image again (as far as i can tell) so you basicaly have an n-pair problem where every pixel has some correlation with every other pixel.

Shared memory can help you reduce the number of times you fetch each pixel from global memory. If you have a 128 (or whatever) thread block size, then you will only need to load a given pixel once from global memory and then store it in shared memory to make it available to all 128 threads of that block.

If you are reading the image “in order”. That is, going through all the x’s then all the y’s then all the z’, your memory operations would most likely be coalesced (see the programming guide if this word is new) and therefore would be better not to use textures.

So in essence, your outer triple loop will be handled by the hardware, aka you will launch one thread per pixel. Your inner triple loop will be done inside the kernel.
Thats how i would do it at first anyway.

xargon · August 16, 2009, 12:32pm

Install the SDK and look at the simple3dTexture exemple.

Also, from your CPU code, it looks like you could benifit from shared memory a lot.

The inner triple loop goes over all of the image again (as far as i can tell) so you basicaly have an n-pair problem where every pixel has some correlation with every other pixel.

Shared memory can help you reduce the number of times you fetch each pixel from global memory. If you have a 128 (or whatever) thread block size, then you will only need to load a given pixel once from global memory and then store it in shared memory to make it available to all 128 threads of that block.

If you are reading the image “in order”. That is, going through all the x’s then all the y’s then all the z’, your memory operations would most likely be coalesced (see the programming guide if this word is new) and therefore would be better not to use textures.

So in essence, your outer triple loop will be handled by the hardware, aka you will launch one thread per pixel. Your inner triple loop will be done inside the kernel.

Thats how i would do it at first anyway.

Thanks a lot. This really helps.

However, did you not say that the shared memory is limited and I cannot move data directly into shared memory. So, the image initially needs to be read into the global (texture) memory, right?

Also, what is happening in my loop is as follows:

In the outer loops, I load a selected chunk of data from one image.
In the inner loops, a chuck of data in the given neighborhood is loaded from the second image. What is happening is that a number of blocks of data is loaded from the second image and compared to the first block from the first image.

So, do you think I would benefit from using shared memory in this scenario as well?

Many many thanks,

xarg

Ailleur · August 16, 2009, 1:12pm

Tought for me to say without knowing the problem completly since it will depend on the size of the chunks and your definition of “neighbourhood”.

Im am assuming that the neighbourhood at pixel (i,j) is pretty similar to the neighbourhood at pixel (i+1,j). So you would only need to load an “enlarged” neighbourhood once from global to shared memory.

You cannot move data directly from your system ram to shared memory is what i said. i.e. you cannot do a cudamemcpy to shared memory.
What you can do is, inside your kernel, copy from global memory to shared memory.
The goal here is to reduce the number of memory transactions going from the processors to global memory, and this is what we’re using shared memory for. If you load that “super” neighbourhood only once from global memory to shared memory for a number of pixels.

And yes, shared memory is limited (see the programming guide in the appendix) to 16kb. So as ive said, it will depend on your definition of neighbourhood.

What i would suggest, since this is most likely your first cuda implementation, is to start by actually forgetting all ive said about shared memory. Start by doing it all from global memory, just to get things working. Then, from that point, when you actually get the results youre looking for at the end, do try to move the redundant memory transactions to shared memory. But it will be a lot easier to debug if you forget all about shared memory for your very first try at implementing a cuda application.

Good luck!

xargon · August 16, 2009, 1:36pm

Thanks a lot! I guess there is no point in dilly-dallying anymore!

Topic		Replies	Views
Filtering of Volumetric datasets CUDA Beginner asking for design advice CUDA Programming and Performance	5	3931	September 17, 2009
Cuda good practices for image processing CUDA Programming and Performance	8	15503	February 12, 2009
How can access mapped global memory in coalesced manner? (texture memory?) CUDA Programming and Performance cuda	1	406	May 19, 2023
CUDA texture memory performance CUDA Programming and Performance	4	33572	January 13, 2009
Array Problem CUDA Programming and Performance	5	1462	January 27, 2010
Doubts related to CUDA CUDA Programming and Performance	17	11807	November 18, 2010
Memory performance in image processing example CUDA Programming and Performance	9	1606	March 24, 2011
Textures CUDA Programming and Performance	2	1637	July 22, 2008
CUDA performance How to improve my code? CUDA Programming and Performance	16	13574	June 29, 2010
memory confusion how big is local/shared/global memory? CUDA Programming and Performance	6	3433	January 20, 2009

Help with some CUDA concepts

Related topics