Strange timing behavior explanation needed

I am working on a raycaster and got a strang timing behavior -
maybe somebody can help me to understand the received results…
( I attached the speed-table as gif)

The time to render one frame (512x512) seems to be dependent on the ray-cast distance,
which makes sense, but independent from the number of raycasted columns…

However, raycasting twice as many columns means also twice
the computational effort…

The current implementation raycasts one column by using one thread, however it
seems to be no difference for casting 100 or 4000 columns … ???
Seems there to be an error in my implementation or is it because of the cards design ?

hope somebody understand why this might happen…



in general 20 ms is not a reasonable time to measure. I would suggest you wrap your program into a for loop doing the same thing over and over again until you come at least above 1sec of wall clock time.

The following could be wrong if I didn’t understand your scheduling!

If you run with 100 columns and each thread treats one column, you actually have 28 threads or processors left on the device which are doing nothing. As all threads wait for the memory with their access it could be possible that the additional threads/columns hide within that latency. Perhaps you can enlarge your problem to more columns?

How do you schedule the blocks?

Thank you for the quick reply.
Yes, I think the distribution of the threads isn’t be optimal indeed - my card is a GTS, so I only have 96 instead of 128 processors…

However, even there is a fraction left because of 96 cpu’s, the should be a
difference between runnning two/three blocks for 256 columns or ten for 2048
in my opinion.

I just tried to see what happens if I run the kernel multiple times:
1x : 44ms
2x : 88ms
Seems that this is linear.

The grid to call the kernel looks as follows:

dim3 threads( 128,1,1 );
dim3 grid( columns / threads.x,1,1 );

cudaRender<<< grid, threads, 16300 >>>(render_gpu);

global void cudaRender(Render* render)
int x = (blockIdx.x * blockDim.x + threadIdx.x);

Do I have to use the y- and z-part of the grid to obtain full performance?
I cant increase the threads to more than 128 however (register limit)

I found it is getting linear from 8192 columns:
4096 col : 44ms
8192 col : 44ms
16384 col : 88ms
32768 col : 144 ms
However, I can’t explain it.

I just made some more detailed performance analysis using the profiler - and there is quite a difference between 2048 columns (session2) and 4096 columns (session3).
The gpu-time is same - however, there is a major difference in the cta_launched field.

Is it possible to take influence to this somehow? It seems to be the reason for my question.

The results of the profiler are attached below.


Today I continued to figure out what could have caused this strange behavior and got the following result:

(128 threads)
raycast 4096 columns : 44ms
raycast 4096 columns but run the code with x*2 calls like if(x&1)render(x/2): 50ms

(64 threads)
raycast 2048 columns : 44ms
(128 threads)
raycast 2048 columns : 44ms
raycast 2048 columns but run the code with x*2 calls like if(x&1)render(x/2): 38ms

-> The code sometimes gets faster as more the GPU has to do - its 4096 instead of 2048 calls and only half the processors are used as I return for every second thread.

I think I should stop my optimizing/profiling here before I get crazy ;-)


Can you try the 512 to 2048 ?

I’m not an expert on what CTA might be, I think it is correlated to the number of blocks you invoke.

As I don’t know your access pattern to the memory, the strided access could cause a lot of trouble. However besides everything, as your profiler output shows a lot of incoherent accesses and local loads, I would try to optimize them away first.

At the end, while you have uncoalesced memory access more threads could fill the latency and actually perform as fast as less threads.