Huge data structures

francy300485 · January 30, 2012, 6:28pm

Hello everybody,
this is my problem:

I have a 3D image stored in global memory as a 1D array, and I would to find a spherical neigborhood of each voxel to compute some statistics inside it. I would associate one thread to each voxel, but if the image and the sphere radius are big I cannot use a single thread grid to do this since I’m limited by the maximum grid dimensions. I tried to split this operations in different kernels but it is really time consuming.
Any idea?
thanks in advance

short · January 30, 2012, 6:33pm

You could loop over inside your kernel, so that each block does more work!

RezaRob3 · January 30, 2012, 9:27pm

Yes, as ‘short’ said, get each thread to do more than one voxel. You should only need enough threads to keep your compute units busy and hide latency. On Fermi each compute unit(multiprocessor) can have 1536 resident threads maximum.

When you say “time consuming” do you mean programming time or GPU runtime, just out of curiosity?

francy300485 · January 31, 2012, 6:04pm

Hi RezaRob, Hi Short, thank you for your answers.

Actually, I did so, using a loop to find the neighborhood, meaning that I work in parallel on the domain of the image, but I compute my statistics inside each neighborhood serially, it works well, but I would to work in completely in parallel. (solution 1)

Following your suggestions, I also tried to work in parallel on each neghborhood, splitting the image domain and using a for loop over each sub-image, instead of splitting the domain and launching many kernels, but this solutions is still to slow, w.r.t. the solution 1.

Anyway I will try to think a better associaton beetween threads and workload… :-) If you have any idea, I will appreciate ;-).

@Reza With time consuming I mean gpu execution time. Just a question, how Can I have 1536 thread if the maximum number of thread for each block is 1024 on a Fermi arch with compute capability 2.0?

RezaRob3 · January 31, 2012, 6:46pm

Use more blocks.

francy300485 · January 31, 2012, 6:53pm

Use more blocks: 2 to 8 will do. However, if you are using atomics(atomicAdd() etc.) within each block, in a way that several active warps(within one block) have to wait for one warp to release some (atomically) locked variable, then more blocks may be better. The different blocks will be independent and accessing different “copies/instances” of the atomic variable in (let’s say) shared memory. That way, you can have 8 active warps, in 8 different active blocks, all doing some atomic operation, while all the other warps within those blocks might be waiting for some lock to be released. That is much better for latency hiding than just two active(lock owning) warps within two active blocks. It all depends on what exactly you’re doing, and I don’t really know the details.

The benefit of using ONLY one (active) block (1024 threads) is that the block can own the entire shared memory. More active blocks means you must split the shared memory between them.

You might want to read the programming guide and some of the samples more elaborately to better understand what’s going on.

Thanks. Yes, now I understood what did you mean. Actually I’m using more than one block, of course, and I’m doing some reductions inside the kernel splitting the shared memory between blocks.

RezaRob3 · January 31, 2012, 11:49pm

What you quoted I have already edited out, because I realized it was BS! However, there are technical reasons why using smaller block sizes becomes necessary or sometimes convenient, because things like syncthreads() operate on entire blocks of threads and so block size matters.

cmaster.matso · February 3, 2012, 8:09am

francy300485: Tried using 3D textures? For better data coherence?

francy300485 · February 7, 2012, 10:33am

Hi, sorry for the delay.

Actually I didn’t try. Can you detail which advantages I could obtain, please? Thank you

tera · February 7, 2012, 12:28pm

A 3D texture can give you a better cache hit rate, because more of the other points the cachelines fetched from global memory will fall into you area of interest.

Topic		Replies	Views
Filtering of Volumetric datasets CUDA Beginner asking for design advice CUDA Programming and Performance	5	3931	September 17, 2009
Warp layout in a 2D thread block? CUDA Programming and Performance	6	8524	July 21, 2011
Optimizing calculation question Euclidean distance between arrays CUDA Programming and Performance	5	5337	March 1, 2010
Maximizing the number of threads per block leads to longer kernel execution times CUDA Programming and Performance cuda , kernel	12	1592	December 19, 2023
Need help to better understand CUDA structure CUDA Programming and Performance	7	1090	May 17, 2011
increasing blokSize -> Faster or slower CUDA Programming and Performance	4	860	September 12, 2011
The choose of grid size and block size CUDA Programming and Performance	8	1906	May 8, 2024
Calculating the optimal grid and block size? CUDA Programming and Performance	1	6117	August 30, 2011
CUDA motivation for multi-dimensional kernel execution CUDA Programming and Performance	6	4142	December 8, 2013
Test Multi Threading Spinning CUDA Programming and Performance	32	4810	July 20, 2011

Huge data structures

Related topics