You’re running only 256 threads, right? That’s too little for GPU.
Number of threads per block is detrmined by register/shared memory usage of your kernel and should be chosen to maximize occupancy.
Number of blocks should be chosen to provide fair utilization of GPU resources. Exact figures depend on kernel, but I personally would not spawn grid less than, say 256 blocks of 16 threads (if kernel is complex and takes a long time to run).
This removes loop inside your kernel and makes code more clean and readable IMO. You should also adjust GRID_SIZE to something reasonable for your kernel (so that it runs not too quick and not too slow, something 50-500ms is reasonable).
execution time for 0x100000000 iteration was 180 seconds. Old version was 207 seconds, but freezes PC much more. Tanx again
occupancy 83% by profiler result wit blocks=16 anmd threads = 320 (benchmarked all variants)
now, in some mycompute(idx) i has correct results which i have to send to host ,
i made checking of result inside mycomputekernel(), but how can i inform host about found value ?
do i need to copy found value to device_result? seems it very slow…
i made an array for results in shared memory, and store the found values into it, is it fast ?
You can have variable in device memory which will act as a flag – if mycompute() has something to report to host you just set this flag to 1.
From host you only check this flag and if it is set, then download whole value.