I followed the documentation, walk-through examples bitreverse;
After hitting the breakpoint, I switch to warp3;
Then type the continue command, and when it stops again, it automatically switches to warp0;
Is thisinconsistent with the description of the document?
Why do I only see the same thread hit the breakpoint every time; other warp/threads that hit the breakpoint are not displayed?
Even if I switch to warp 3; it doesn’t mean that the warp will hit the breakpoint, right? Switching focus is useless, only displayed the warp which hits the breakpoint first and ignore the warp/thread that hits the breakpoint later;
When warp0 hits the breakpoint, some warps are before the breakpoint and some are after the breakpoint,
Is my understanding correct?
My question is will each warp stop when it hits a breakpoint? Like the thread of the cpu; or switch to a certain warp, only this warp can hit the breakpoint;
When a warp hits a breakpoint and stops, do all warps hit the breakpoint?
Thank you for the logs, now I can provide must better explanation. First of all, please note that execution is happening on the warp level (so all threads of the warp execute the same instruction/hit the breakpoint).
When running the bitreverse sample warps don’t diverge (so all warps are executing the same instructions), **which is not always the case **.
When warp 0 hits the first breakpoint, all warps hit it at the same time (note the PC addresses of the active warps). In such case debugger only reports single BP hit (for an arbitrary warp).
My question is will each warp stop when it hits a breakpoint? Like the thread of the cpu; or switch to a certain warp, only this warp can hit the breakpoint;
When BP is hit by a warp (or multiple warps if multiple warps arrive to it at the same time) all warps on GPU are stopped (docs link: CUDA-GDB)
When a breakpoint is set, it forces all resident GPU threads to stop at this location when it reaches the corresponding PC.
Focus doesn’t affect GPU side execution/breakpoints. It’s only used to specify current warp for debugger operations.
When a warp hits a breakpoint and stops, do all warps hit the breakpoint?
All warps, which are executing the same instructions as the warp, which hit the breakpoint. E.g. if due to some reason there is a divergence and warps 0-3 diverge from warps 4-7 the following might happen (one possible scenario):
Warps 0-3 hit the breakpoint 1 - debugger reports breakpoint for (e.g.) warp 0
Warps 4-7 hit the breakpoint 1 - debugger reports breakpoint for (e.g.) warp 4
Warps 0-3 hit the breakpoint 2 - debugger reports breakpoint for (e.g.) warp 0
Warps 4-7 hit the breakpoint 2 - debugger reports breakpoint for (e.g.) warp 4
Warps 0-3 hit the breakpoint 3 - debugger reports breakpoint for (e.g.) warp 0
Warps 4-7 hit the breakpoint 3 - debugger reports breakpoint for (e.g.) warp 4
That is to say, in any scenario, if there is no divergence , if the GDB reports that a certain warp hits a breakpoint,does it mean that all warps/threads have hit this BP;
Unlike CPU threads, each thread (GPU warps) hits the same breakpointonce;
I type the “where” command and see the result is that all warps run to the same line of code;
Yes, if there is no divergence (so all warps are executing the same instruction), all warps would hit the BP at the same time, but the debugger would only report it once.
It can be confirmed by checking the warp status (e.g. cuda info warps) to check that all warps have the same PC (program counter).
If you want to observe different behavior you can also introduce divergence manually by advancing a warp via step command (note that when you issue step command it only applies to the current warp):
Start as usual
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
Thread 1 "a.out" hit Breakpoint 1, bitreverse<<<(1,1,1),(256,1,1)>>> (data=0x7fffc1200000) at bitreverse.cu:14
14 array[threadIdx.x] = ((0xf0f0f0f0 & array[threadIdx.x]) >> 4) |
(cuda-gdb) info cuda warps
Wp Active Lanes Mask Divergent Lanes Mask Active Physical PC Kernel BlockIdx First Active ThreadIdx
Device 0 SM 0
* 0 0xffffffff 0x00000000 0x00000000000002e0 0 (0,0,0) (0,0,0)
1 0xffffffff 0x00000000 0x00000000000002e0 0 (0,0,0) (32,0,0)
2 0xffffffff 0x00000000 0x00000000000002e0 0 (0,0,0) (64,0,0)
3 0xffffffff 0x00000000 0x00000000000002e0 0 (0,0,0) (96,0,0)
4 0xffffffff 0x00000000 0x00000000000002e0 0 (0,0,0) (128,0,0)
5 0xffffffff 0x00000000 0x00000000000002e0 0 (0,0,0) (160,0,0)
6 0xffffffff 0x00000000 0x00000000000002e0 0 (0,0,0) (192,0,0)
7 0xffffffff 0x00000000 0x00000000000002e0 0 (0,0,0) (224,0,0)
(cuda-gdb)
When warps hit the first breakpoint, advance the warp 0 to the 3rd breakpoint (note that PC has not changed for other warps):