Fast reading of some array

Julek · December 16, 2009, 1:49pm

I have one block and ‘n’ threads.
I have an float array ‘arr’ allocated on GPU - the size = kn (k is known before the kernel is launched).
t-th thread needs to access elements of array with indices between 'tk’ and ‘(t+1)*k-1’ (e.g. when k=4, then second thread accesses elements number 4,5,6,7).
How to do it fast? This solution is slow (uncoalesced):

for(int a=0;a<k;++a)
{
// Use arr[threadIdx.x * k + a]
}

I made some solution - I read blockDim.x elements to shared memory (coalesced read), then in each thread I check if I can use these loaded elements. Then I read next blockDim.x elements, I do it until I read all k*n elements.
The problem is, I get many divergent threads and the solution is only about 2 times as fast as the first one.

Is there any better solution?

LSChien · December 16, 2009, 2:18pm

parallel programming is not unique, I think that your setting is not good

why not fine-grain (each thread deal with one element) ?

Julek · December 16, 2009, 6:35pm

I don’t understand what you mean…

I can’t change order of variables in the array.

kbam · December 17, 2009, 7:18am

What is the typical size of arr, or range of sizes that you want to work on.

The GPU works best if you have multiple blocks. So can you divide the problem up into multiple smaller blocks ?
as a guideline NVIDIA say for best performance the GPU prefers 8k threads or more. (For example 32 blocks of 256 threads)

Topic		Replies	Views
Quick Thread Question Regarding Calling a kernel CUDA Programming and Performance	13	3635	June 26, 2008
efficient indexing for arrays CUDA Programming and Performance	1	2620	October 10, 2008
N threads read N+1 elements: Coalesced possible? CUDA Programming and Performance	10	4121	March 11, 2008
Management of threads CUDA Programming and Performance	3	1842	March 28, 2010
A "simple" question CUDA Programming and Performance	2	1502	October 30, 2007
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11770	March 20, 2009
Coalesced accesses on different arrays CUDA Programming and Performance	2	1785	November 10, 2009
I can access to only the first 8 elements of the array cannot acces to every element of the array CUDA Programming and Performance	5	5477	October 27, 2009
parralell array processing CUDA Programming and Performance	1	935	May 29, 2009
can a thread read 16bytes once? (no coalesce) CUDA Programming and Performance	2	6212	March 26, 2007

Fast reading of some array

Related topics