warp scheduling

Hi, i have a kernel with just one block, two warps.
One is doing some computation, no global mem access. The other one is does nothing but a while-loop.
As in programing guide, two warps should be switched to run on a MP, with zero context switch overhead. This means that the second while-loop warp should compete the MP’s cycles with warp 1. If I add a dummy global mem read into the while-loop of warp 2. It should take fewer chances to run on MP, because most of time, the warp is waiting for the global mem.
But this is not the case. I use clock() to measure the total execution of kernel, and get exactly the same cycles with or without the global mem reading in warp 2’s loop.
Why???

May be the nvcc compiler ignored some nonsense operations.

Definitely the compiler optimized away all of the dead code. If a computed value is not written to global memory, the compiler will aggressively remove all code needed to calculate that value, even entire loops.

Thanks!

But I tried to insert some global mem read and write like this:

[codebox]

device volatile int dummy_data[32];

device volatile int dummy_res[32];

int dummy;

while(*flag!=v) {

    dummy=dummy_data[threadIdx.x%warpSize];

    dummy_res[threadIdx.x%warpSize]=dummy;

}

[/codebox]

But it does not show any difference with or without the global mem in the loop. Do you have any idea? Will nvcc still delete these codes in the loop?

No, it won’t delete your global memory reads and writes. If you want to verify, you can compile with the -ptx option and examine the assembly code.

What is the purpose of your *flag dereference, though? Unless that pointer is declared volatile, the compiler is very likely to just read in flag once and execute the while loop based on the *flag value stored in a register. It is really hard to help without seeing your whole code.

Other things to think about: Which clock() are you calling?

The one on the GPU? Then you won’t learn anything interesting. The clock is per multiprocessor. See the clock example in the SDK for full documentation.

Or are you calling clock() on the CPU? It has a ridiculously poor precision. If your kernel executes in a short time, you might be at the limit of that resolution. Or you may not be calling cudaThreadSynchronize() to measure the actual kernel run time. Are you checking for errors after the kernel call? Maybe it is not even launching.

Yeah, I do declare the *flag as volatile, like shared volatile int *flag, for the purpose of synchronization between two warps of threads. when the first warp has not finished its run, the second warp won’t leave this while loop. And I use the clock() in kernel (executed in GPU), to see the end-to-end execution cycles of the block.

My understanding is the end-to-end execution cycles of the block should be varied if we have the global mem access in the second warp (as in the above code) or not. This is because, without the global mem access, the second warp is always ready for next instruction fetch, (though always stay in the while loop); yet with the global mem access in the loop, the second warp should have fewer chances to be ready, since it will be waiting for the global mem access latency.

But I do not see this difference using clock() to measure the cycles of the whole block. Anything wrong with my code or thought?