My kernel CUDA program walk with one cycle with fixed iterations (or in other case - with array).
If I increase cuda kernel runs with (x4) multiplier that my cuda program have the same performance.
But if I increase (x4) passes this cycle ( or array ) in each kernel that is slowly in 1,5 times.
[ cuda kernel runs 310000 (if x4 that is 1240000) ]
Why is it ?
[ cuda device with sm_11 ]