I learned that they are undefined. But I read a post about optimizing L2 cache when computing gemm.
I wonder how this can be achieved since the execution order of cuda blocks are undefined?
I learned that they are undefined. But I read a post about optimizing L2 cache when computing gemm.
I wonder how this can be achieved since the execution order of cuda blocks are undefined?
Even if there is no guarantee, perhaps one can assume that the order is roughly ascending. Improving cache efficiency helps with performance, but even if the order is different, the program would still give the correct result.
Another possibility is to take the block position not from the block index, but by determining at runtime, e.g. by increasing a global atomic value with each block. So the number is assigned in runtime order.
Thank you! I think I understand it.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.