Optimum perfomance Blocks/Treadd/Dimensions

arcon · January 15, 2009, 7:43pm

At 9800GTX+

my code is

for (i=0;i<256;i++)
{
kernel_proc <<< 16,16 >>> (RESULT,i)
}

device kernel proc()

{
…
idx = blockthread.xblockdim.x+threadidx.x;
s = i0x1000000 + idx0x10000;
e = i0x1000000 + idx*0x10000+0x10000;
for (k=s;k<e;k++) mycompute(k);
…
}

total complexity is 0x100000000 cycles

when i changing the number of blocks/treads - so that (blocks_num * threads_num == 256)
i always has the same computation time.

Any way to reduce computation time?

AndreiB · January 15, 2009, 8:15pm

You’re running only 256 threads, right? That’s too little for GPU.
Number of threads per block is detrmined by register/shared memory usage of your kernel and should be chosen to maximize occupancy.
Number of blocks should be chosen to provide fair utilization of GPU resources. Exact figures depend on kernel, but I personally would not spawn grid less than, say 256 blocks of 16 threads (if kernel is complex and takes a long time to run).

arcon · January 15, 2009, 9:09pm

tanx for answer,

My kernel takes 13 registers as profiler shows, and 28 bytes of shared memory.

Occupation is 67%

Increasing number of blocks 256x16 = computation time is the same as 16x16

i need total iterations:

16 bloks x 16 threads x 256(i) x 0x10000(per cycle) = 0x100000000 iterations, maximum value i need.

AndreiB · January 15, 2009, 9:41pm

For 13 register per kernel I would say block size of 256 threads is a best choice.

If you need to do mycompute(k) for each k in range 0…0xFFFFFFFF then I’d suggest you to do something like this:

#define BLOCK_SIZE  256

#define GRID_SIZE   1024

__device__ mycomputekernel( unsigned in offset ) {

 unsigned int id = offset + blockId.x * BLOCK_SIZE + threadId.x;

mycompute( id );

}

// -- Host code --

unsinged __int64 offset = 0;

while( offset < 0x100000000 ) {

   mycomputekernel<<< GRID_SIZE, BLOCK_SIZE >>> ( (unsigned int) offset );

   cudaThreadSynchronize();

offset += GRID_SIZE * BLOCK_SIZE;

};

This removes loop inside your kernel and makes code more clean and readable IMO. You should also adjust GRID_SIZE to something reasonable for your kernel (so that it runs not too quick and not too slow, something 50-500ms is reasonable).

arcon · January 16, 2009, 9:33am

Tanx for help, i treid you version of code.

execution time for 0x100000000 iteration was 180 seconds. Old version was 207 seconds, but freezes PC much more. Tanx again
occupancy 83% by profiler result wit blocks=16 anmd threads = 320 (benchmarked all variants)

now, in some mycompute(idx) i has correct results which i have to send to host ,
i made checking of result inside mycomputekernel(), but how can i inform host about found value ?

do i need to copy found value to device_result? seems it very slow…

i made an array for results in shared memory, and store the found values into it, is it fast ?

AndreiB · January 16, 2009, 9:54am

You can have variable in device memory which will act as a flag – if mycompute() has something to report to host you just set this flag to 1.
From host you only check this flag and if it is set, then download whole value.

Topic		Replies	Views
too large kernel solutions CUDA Programming and Performance	11	4280	September 2, 2008
CUDA perormances CUDA Programming and Performance	10	7127	January 22, 2008
Newbie: More threads == much slower? :( CUDA Programming and Performance	4	2077	July 25, 2008
Maximum number of threads How to find maximum number of threads your Card can support CUDA Programming and Performance	16	10254	July 7, 2009
increasing blokSize -> Faster or slower CUDA Programming and Performance	4	860	September 12, 2011
Performance Threads, blocks, registers and shared memory CUDA Programming and Performance	2	5953	July 29, 2009
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7009	January 30, 2008
Lots of Threads vs. Shared Memory CUDA Programming and Performance	9	8349	February 12, 2008
Can I Control Thread ID? CUDA Programming and Performance	3	4339	June 9, 2008
better performance from underpopulated warps CUDA Programming and Performance	6	2439	June 28, 2008

Optimum perfomance Blocks/Treadd/Dimensions

Related topics