warp scheduling

fji · August 6, 2009, 6:34am

Hi, i have a kernel with just one block, two warps.
One is doing some computation, no global mem access. The other one is does nothing but a while-loop.
As in programing guide, two warps should be switched to run on a MP, with zero context switch overhead. This means that the second while-loop warp should compete the MP’s cycles with warp 1. If I add a dummy global mem read into the while-loop of warp 2. It should take fewer chances to run on MP, because most of time, the warp is waiting for the global mem.
But this is not the case. I use clock() to measure the total execution of kernel, and get exactly the same cycles with or without the global mem reading in warp 2’s loop.
Why???

iceberg · August 6, 2009, 6:42am

May be the nvcc compiler ignored some nonsense operations.

MisterAnderson42 · August 6, 2009, 1:15pm

Definitely the compiler optimized away all of the dead code. If a computed value is not written to global memory, the compiler will aggressively remove all code needed to calculate that value, even entire loops.

fji · August 7, 2009, 2:11am

Thanks!

But I tried to insert some global mem read and write like this:

[codebox]

device volatile int dummy_data[32];

device volatile int dummy_res[32];

int dummy;

while(*flag!=v) {

    dummy=dummy_data[threadIdx.x%warpSize];

    dummy_res[threadIdx.x%warpSize]=dummy;

}

[/codebox]

But it does not show any difference with or without the global mem in the loop. Do you have any idea? Will nvcc still delete these codes in the loop?

MisterAnderson42 · August 7, 2009, 11:02am

Thanks!

But I tried to insert some global mem read and write like this:

[codebox]

device volatile int dummy_data[32];

device volatile int dummy_res[32];

int dummy;

while(*flag!=v) {
    dummy=dummy_data[threadIdx.x%warpSize];

    dummy_res[threadIdx.x%warpSize]=dummy;

}
[/codebox]

But it does not show any difference with or without the global mem in the loop. Do you have any idea? Will nvcc still delete these codes in the loop?

No, it won’t delete your global memory reads and writes. If you want to verify, you can compile with the -ptx option and examine the assembly code.

What is the purpose of your *flag dereference, though? Unless that pointer is declared volatile, the compiler is very likely to just read in flag once and execute the while loop based on the *flag value stored in a register. It is really hard to help without seeing your whole code.

Other things to think about: Which clock() are you calling?

The one on the GPU? Then you won’t learn anything interesting. The clock is per multiprocessor. See the clock example in the SDK for full documentation.

Or are you calling clock() on the CPU? It has a ridiculously poor precision. If your kernel executes in a short time, you might be at the limit of that resolution. Or you may not be calling cudaThreadSynchronize() to measure the actual kernel run time. Are you checking for errors after the kernel call? Maybe it is not even launching.

fji · August 7, 2009, 5:51pm

No, it won’t delete your global memory reads and writes. If you want to verify, you can compile with the -ptx option and examine the assembly code.

What is the purpose of your *flag dereference, though? Unless that pointer is declared volatile, the compiler is very likely to just read in flag once and execute the while loop based on the *flag value stored in a register. It is really hard to help without seeing your whole code.

Other things to think about: Which clock() are you calling?

The one on the GPU? Then you won’t learn anything interesting. The clock is per multiprocessor. See the clock example in the SDK for full documentation.

Or are you calling clock() on the CPU? It has a ridiculously poor precision. If your kernel executes in a short time, you might be at the limit of that resolution. Or you may not be calling cudaThreadSynchronize() to measure the actual kernel run time. Are you checking for errors after the kernel call? Maybe it is not even launching.

Yeah, I do declare the *flag as volatile, like shared volatile int *flag, for the purpose of synchronization between two warps of threads. when the first warp has not finished its run, the second warp won’t leave this while loop. And I use the clock() in kernel (executed in GPU), to see the end-to-end execution cycles of the block.

My understanding is the end-to-end execution cycles of the block should be varied if we have the global mem access in the second warp (as in the above code) or not. This is because, without the global mem access, the second warp is always ready for next instruction fetch, (though always stay in the while loop); yet with the global mem access in the loop, the second warp should have fewer chances to be ready, since it will be waiting for the global mem access latency.

But I do not see this difference using clock() to measure the cycles of the whole block. Anything wrong with my code or thought?

Topic		Replies	Views
How many warps per SM to hide global mem latency? CUDA Programming and Performance	15	14268	November 18, 2008
hiding global memory access do I need 2 warps? CUDA Programming and Performance	1	982	January 22, 2010
Global Memoy latencies and NVIDIA cards Latency CUDA Programming and Performance	15	8945	January 11, 2008
Writing global memory 14 times slower than reading? CUDA Programming and Performance	6	10198	January 19, 2011
Parallel Access to GDU Global Memory CUDA Programming and Performance	9	9001	January 24, 2008
Thread and Instruction Scheduling CUDA Programming and Performance	3	3366	August 17, 2007
Global Memory Fetches How to arrange them in code for best performance CUDA Programming and Performance	6	1261	June 2, 2010
Concurrency of Global Memory Operations CUDA Programming and Performance	1	598	February 17, 2011
Warp Schedulling CUDA Programming and Performance	7	8080	October 22, 2010
Divergence penalty CUDA Programming and Performance	2	1920	June 14, 2008

warp scheduling

Related topics