Optimum perfomance Blocks/Treadd/Dimensions

At 9800GTX+

my code is

for (i=0;i<256;i++)
{
kernel_proc <<< 16,16 >>> (RESULT,i)
}

device kernel proc()

{

idx = blockthread.xblockdim.x+threadidx.x;
s = i
0x1000000 + idx0x10000;
e = i
0x1000000 + idx*0x10000+0x10000;
for (k=s;k<e;k++) mycompute(k);

}

total complexity is 0x100000000 cycles

when i changing the number of blocks/treads - so that (blocks_num * threads_num == 256)
i always has the same computation time.

Any way to reduce computation time?

You’re running only 256 threads, right? That’s too little for GPU.
Number of threads per block is detrmined by register/shared memory usage of your kernel and should be chosen to maximize occupancy.
Number of blocks should be chosen to provide fair utilization of GPU resources. Exact figures depend on kernel, but I personally would not spawn grid less than, say 256 blocks of 16 threads (if kernel is complex and takes a long time to run).

tanx for answer,

My kernel takes 13 registers as profiler shows, and 28 bytes of shared memory.

Occupation is 67%

Increasing number of blocks 256x16 = computation time is the same as 16x16

i need total iterations:

16 bloks x 16 threads x 256(i) x 0x10000(per cycle) = 0x100000000 iterations, maximum value i need.

For 13 register per kernel I would say block size of 256 threads is a best choice.

If you need to do mycompute(k) for each k in range 0…0xFFFFFFFF then I’d suggest you to do something like this:

#define BLOCK_SIZE  256

#define GRID_SIZE   1024

__device__ mycomputekernel( unsigned in offset ) {

 unsigned int id = offset + blockId.x * BLOCK_SIZE + threadId.x;

mycompute( id );

}

// -- Host code --

unsinged __int64 offset = 0;

while( offset < 0x100000000 ) {

   mycomputekernel<<< GRID_SIZE, BLOCK_SIZE >>> ( (unsigned int) offset );

   cudaThreadSynchronize();

offset += GRID_SIZE * BLOCK_SIZE;

};

This removes loop inside your kernel and makes code more clean and readable IMO. You should also adjust GRID_SIZE to something reasonable for your kernel (so that it runs not too quick and not too slow, something 50-500ms is reasonable).

Tanx for help, i treid you version of code.

execution time for 0x100000000 iteration was 180 seconds. Old version was 207 seconds, but freezes PC much more. Tanx again
occupancy 83% by profiler result wit blocks=16 anmd threads = 320 (benchmarked all variants)

now, in some mycompute(idx) i has correct results which i have to send to host ,
i made checking of result inside mycomputekernel(), but how can i inform host about found value ?

do i need to copy found value to device_result? seems it very slow…

i made an array for results in shared memory, and store the found values into it, is it fast ?

You can have variable in device memory which will act as a flag – if mycompute() has something to report to host you just set this flag to 1.
From host you only check this flag and if it is set, then download whole value.