I have a 3D image stored in global memory as a 1D array, and I would to find a spherical neigborhood of each voxel to compute some statistics inside it. I would associate one thread to each voxel, but if the image and the sphere radius are big I cannot use a single thread grid to do this since I’m limited by the maximum grid dimensions. I tried to split this operations in different kernels but it is really time consuming.
thanks in advance
Yes, as ‘short’ said, get each thread to do more than one voxel. You should only need enough threads to keep your compute units busy and hide latency. On Fermi each compute unit(multiprocessor) can have 1536 resident threads maximum.
When you say “time consuming” do you mean programming time or GPU runtime, just out of curiosity?
Actually, I did so, using a loop to find the neighborhood, meaning that I work in parallel on the domain of the image, but I compute my statistics inside each neighborhood serially, it works well, but I would to work in completely in parallel. (solution 1)
Following your suggestions, I also tried to work in parallel on each neghborhood, splitting the image domain and using a for loop over each sub-image, instead of splitting the domain and launching many kernels, but this solutions is still to slow, w.r.t. the solution 1.
Anyway I will try to think a better associaton beetween threads and workload… :-) If you have any idea, I will appreciate ;-).
@Reza With time consuming I mean gpu execution time. Just a question, how Can I have 1536 thread if the maximum number of thread for each block is 1024 on a Fermi arch with compute capability 2.0?
What you quoted I have already edited out, because I realized it was BS! However, there are technical reasons why using smaller block sizes becomes necessary or sometimes convenient, because things like syncthreads() operate on entire blocks of threads and so block size matters.