I searched Sums Function in forums.
Many Peopls said that See Reduction Example.
So, I’m seeing that.
But, In 4 pages of Reduction DOC… I don’t understand …
IN 4 Pages
Problem : Global Synchronization
CUDA has no global synchronization. Why?
Expensive to build in hardware for GPUs with high processor count
Would force programmer to run fewer blocks (no more than # multiprocessors * # resident blocks / multiprocessor) to avoid deadlock, which may reduce overall efficiency
I don’t understand…
Needs Many Blocks for my program…
I have matched M -Number of Block and Up to 512 threads of a block
I knew that Using Maximum Blocks in One cycles is Most Efficency till yesterday!!
But,That is My miss. Above statement (Would force ~~~~)
Geforce 9800 GT;
a BLOCK has up to 512threads
a Grid has up to 3-D : 512 * 512 * 64
I need 2-Demsion Grid. (512 * 512)
that has 512 * 512 Blocks… equal 262144;
How many can use that one cycles…??
and How code that??
Seperate Auto?? or Manual?
Simple (My code);;
int idx = threadIdx.x;
int j = blockIdx.x + gridDim.x * blockIdx.y;
result[j + idx] = arr[j + idx] * ALPHA;
sums[j] += result[j + idx]
Thanks… Read This…