Huge data structures

Hello everybody,
this is my problem:

I have a 3D image stored in global memory as a 1D array, and I would to find a spherical neigborhood of each voxel to compute some statistics inside it. I would associate one thread to each voxel, but if the image and the sphere radius are big I cannot use a single thread grid to do this since I’m limited by the maximum grid dimensions. I tried to split this operations in different kernels but it is really time consuming.
Any idea?
thanks in advance

You could loop over inside your kernel, so that each block does more work!

Yes, as ‘short’ said, get each thread to do more than one voxel. You should only need enough threads to keep your compute units busy and hide latency. On Fermi each compute unit(multiprocessor) can have 1536 resident threads maximum.

When you say “time consuming” do you mean programming time or GPU runtime, just out of curiosity?

Hi RezaRob, Hi Short, thank you for your answers.

Actually, I did so, using a loop to find the neighborhood, meaning that I work in parallel on the domain of the image, but I compute my statistics inside each neighborhood serially, it works well, but I would to work in completely in parallel. (solution 1)

Following your suggestions, I also tried to work in parallel on each neghborhood, splitting the image domain and using a for loop over each sub-image, instead of splitting the domain and launching many kernels, but this solutions is still to slow, w.r.t. the solution 1.

Anyway I will try to think a better associaton beetween threads and workload… :-) If you have any idea, I will appreciate ;-).

@Reza With time consuming I mean gpu execution time. Just a question, how Can I have 1536 thread if the maximum number of thread for each block is 1024 on a Fermi arch with compute capability 2.0?

Use more blocks.

Thanks. Yes, now I understood what did you mean. Actually I’m using more than one block, of course, and I’m doing some reductions inside the kernel splitting the shared memory between blocks.

What you quoted I have already edited out, because I realized it was BS! However, there are technical reasons why using smaller block sizes becomes necessary or sometimes convenient, because things like syncthreads() operate on entire blocks of threads and so block size matters.

francy300485: Tried using 3D textures? For better data coherence?

Hi, sorry for the delay.

Actually I didn’t try. Can you detail which advantages I could obtain, please? Thank you

A 3D texture can give you a better cache hit rate, because more of the other points the cachelines fetched from global memory will fall into you area of interest.