Strange performance by varying the block size


I’m doing a performance test using GeForce MX150 and a fluid algorithm by varying the block size (from 32 to 1024 by multiple of 2) under fixed number of threads.
The results shows strangely better performance for smaller block sizes using CUDA or OpenCL.

What can be the reason for this?

Thanks in advance!

It’s impossible to say without more details. My first thought is register pressure. Try profiling your kernel with Nsight Compute. It provide insight and tips to bottlenecks you might have.

According to Nsight Compute the bottleneck is visible in the scheduler statistics:

The scheduler statistics shows, that I should increase the number of eligable warps by reducing the time the active warps are stalled.

Is there a simple solution to reduce the stalled time for active warps?

The warp stastics says:
[Warning] On average each warp of this kernel spends 185.8 cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. This represents about 75.4% of the total average of 246.3 cycles between issuing two instructions. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality or by changing the cache configuration, and consider moving frequently used data to shared memory.

This is a general advice and hard to know what exactly should be done.
Besides that, I do not see a relation between my initial question about block size and warp optimization.

“Objection, Your Honor! Calls for speculation!”

Given that no details were supplied about the application, it is nonsensical to expect specific advice.

Finer granularity can lead to a higher utilization of machine resources, which in turn provides better performance. A good starting point when designing a kernel is an initial block size that is a multiple of 32 between 128 and 256 threads, and adjust this up or down based on the specific requirements of the algorithm and profiler feedback.

Generally speaking, extreme thread block sizes (very small or very large) are rarely the best choice from a performance perspective.

What I wanted to say in my previous post was not that your advice was generic, but the output of the Nsight Compute analysis. Nsight only tells something about memory optimization without pointing to a specific position in the code.

Does anyone know what “scoreboard dependency on a L1TEX (local, global, surface, texture) operation” means?
The “stall long scoreboard” attribute has a value of 120.