Warp switching does anybody understands the mechanism

serge · March 24, 2008, 7:36am

As can be understood from CUDA decsription, when several warps are executed on one multiprocessor together, they are usually “switched” one by one. I mean: while one warp is waiting for global memory read, instruction of another warp can be executed in the same time. They call it “hiding of global memory latency”. The similar thing s hiding “read-after-write dependencies latency”.

I hope the previous statements are right. Please correct me if I do mistake.

My question is: does anybody know more details about this mechanism?
i.e.
What does time spent on switching depends from?
Does it matters if warps are from one block or from different blocks?
Something else? I would appreciate any related information.

Thanks in advance.

Sarnath · March 24, 2008, 8:02am

To answer your question above:

I usually use 32 threads per block and multiple blocks per multiprocessor and still get excellent performance. So, my personal opinion is that it does NOT matter which block owns a warp. With a WARP size as my block size, I just dont required __syncthreads() at all. The performance usually rocks because with 32 threads (1 WARP) – there is NO chance of race conditions at all. It makes my program logic simple too.

Ideally – you should query the warp size from the CUDA driver and then use it as your block size. Your portability goes for a toss if you hardcode 32. Thanks for Mark Harris for this tip.

I have no idea on the time spent on switching. In my understanding – it is just resetting the “blockIdx.x,.y,.x and threadIdx.x,.y.,.z” to the values determined by the WARP scheduler and then execute the instruction from the instruction pointer of that WARP. So, WARP scheduling must take very very minimal time.

Best Regards,

Sarnath

serge · March 24, 2008, 8:48am

Well, the “context switching” should take some time?

Oh, I forgot that we have independent registers and shared mem for each warp, so there is no context at all, except “instruction pointer”.

serge · March 24, 2008, 8:57am

About this. When you use 64 threads per block instead 32 it increases occupancy in 2 times (in case it is limited only by reg allocation, not by shared mem). Therefore max numbers of warps which can be ran simultaneously on MP is increased by 2 times. This can increase your performance in 2 times if there are global memory latency or read-after-write latency in your kernel.

(More details in my last post in this topic http://forums.nvidia.com/index.php?showtopic=49707&st=20).

Sarnath · March 24, 2008, 9:17am

Occupany increase does NOT necessarily mean increase in performance…

I work for financial algorithms. By using 32 threads – I eliminate double-buffering and thus reduce shared memory. I generally make sure that I run 6 blocks per multiprocessor. Thats good enough for me.

Increasing to 64 threads would mean – doubling shared memory for my kind of applications – coz, i need to handle race conditions – I use double-buffering to eliminate races (input buffer, result buffer – two buffers).

So, its totally upto the application. If you application is such that each thread computes independently then what you say is right.

BUT

Note that merely running multiple number of threads per multi-processor does NOT mean you are going to get performance. Because the underlying hardware can do only one-warp parallely. The momment you hide all your latencies completley then every extra warp you run on your MP contributes to nothing…

For my apps, I found that 6 WARPs was good enuf to hide global memory as well as register-hazard latencies… So, I stuck with 32 threads… It all depends on what suits your app.

serge · March 24, 2008, 10:10am

For sure. That is that I meant. Sorry for misunderstanding.

Surely.

However, according to my personal experience, what concerns to:

The moment when you reach the warp per MP limit (occupancy*24 per MP) becomes earlier/

Well, thank you for answers :)

MisterAnderson42 · March 24, 2008, 1:04pm

Right. The hardware switches from one warp to another on every single clock cycle with no overhead.

About the whole occupancy argument, the answer can be summed up: Your Milage May Vary_.

There are well documented cases (Saranth) where a block size of 32 is optimal. There are other well documented cases (my own included) where 64, 128, 192, 256, 512, and others in between have been optimal block sizes.

The ABSOLUTELY ONLY WAY you can know what is optimal for your algorithm is to BENCHMARK IT. I cannot stress this enough as parts of my code increase their running time by 100% if you get too far from the optimal block size. Don’t assume that “because someone said so on the forums” that any particular size is the correct one.

serge · March 24, 2008, 2:48pm

Well, probably I’m looking the wrong way.

I am just trying to understand behaviour one of my kernels (see image attached):

IMU almost constant timing is caused by latency hiding. And “jumps” occur when 24*occupancy limit is reached.

I am confused with the difference between the first “step” and the others.
First step seems to represent a linear dependce of working time from warps per MP. (Which seems to be logical: when we increase warps per MP, we win time hiding latency, but the whole time should increase a bit (even if the latency can hide almost whole time of additional warp)).
And the other steps seem to be constant. However, I expected them to have the same linear nature as the first one has.

Sarnath · March 25, 2008, 7:40am

Your Processing Time seems to increase with Increasing the number of warps…

Does it mean that your input-size is also increasing? Can you give details about shared mem usage per block and number of threads per block, Number of registers per thread?

What is your hardware? GTX?

This will help us understand the graph better.

Sarnath · March 25, 2008, 7:41am

Oops. I added reply instead of editing. Ignore this dummy post.

serge · March 25, 2008, 10:40am

Does it mean that your input-size is also increasing?

Yes. Proportionally. I’ve forgot to mention it.

Each thread makes the constant amount of work.

Can you give details …

lmem = 0

	smem = 36

	reg = 28

Number of thread per block is 32. => occupancy is 0.17 and first jump occurs after warps per MP is more than 0.17*24 = 4.

I am using GeForce 8800 GTX, Intel Core 2 Duo 2.33

Sarnath · March 25, 2008, 12:24pm

Note that the number of active blocks in a multi-processor is 8.

And, your smem,thread configuration allows 8 blocks to occupy the MP simultaneously.

So 8/24 warps is your occupancy = 0.3333. I dont understand why you say it is 0.17. Am I missing sthg here?

Since you have 16MPs, 816 = 128 blocks can simultaneously parallely run.
Lets say for your input 128 blocks take “k” seconds
then
1282 blocks would take nearly “2*k” seconds. Because you are limited by a maximum parallel occupation of 128 blocks at any given time.

So, you will have jump for every 128 blocks on your system.
This is equivalent to 8 blocks Per MP == 8 Warps per MP.

That is why you see the JUMP after every 8 Warps per MP.

Also note that:
The dots in your graph does NOT correspond to the X-Axis points well. They look to be taken for “floating point” number of warps per MP – which is obviously wrong… Re-graph it. :-)

serge · March 25, 2008, 4:39pm

And, your smem,thread configuration allows 8 blocks to occupy the MP

simultaneously.

Why? Type given values into CUDA occupancy caclulator and you get 0.17.* And profiler after kernel runs gives the same value.

That is why you see the JUMP after every 8 Warps per MP

After 4 :)

Re-graph it. :-)

Graph is right. I probably should have given better description. See below.

I tested configurations with 8, 16, 24 32, 40, 48, 56, 64 etc blocks. (You remember that in my situation block == warp).

Of course having 24 blocks is stupid from the viewpoint of performance since we have 1 block per MP for “first” 8 MP and 2 blocks per MP for “other” 8 MP. On this graph this situation corresponds to 24 blocks /16 MP = 24 warps / 16 MP = 1.5. :) Of course I uderstand that is not correct to say that we have 1.5 warps per MP. But underlying meaning is obvious - we have 24 warps per 16 MP. And since CUDA allows to test such configuration - I tested it.

*Probably you have not CUDA occ. calc. right now - so I’ve attached the image.

Sarnath · March 26, 2008, 8:46am

INteresting thing on CUDA occupancy…

2832 = 896 – This was the registers that I was expecting for a block
However the XLS sheet shows 1792 == 2896 regs per block. Hence the occupancy of 0.17…

Now, I have read something on this somewhere… – about the 2x factor.

Can some1 kindly explain this pleasssse???

serge · March 26, 2008, 9:54am

Now, I have read something on this somewhere… – about the
2x factor.
I’ve mentioned it here:
http://forums.nvidia.com/index.php?showtopic=49707&st=20

BTW, I said in this topic that 64 blocks per warp is better than 32 just because of this x2 factor.

Can some1 kindly explain this pleasssse???
Well its up to nVidia-people.

Sarnath · March 26, 2008, 10:37am

Aaah… I have been caught unaware… i have been running 32 threads per block and I was just assuming that 7 or 8 blocks run all the time… Ooops… If I dont have 6 blocks – register latency starts hitting me anyway…

Can some1 from nvidia tell sthg about this?

btw, Is there a way of reducing “register” count… ?? Like a compiler option that would optimize for number of registers…

BonsaiScott · March 28, 2008, 1:40am

Pg 16 of nvcc_1.1.pdf, compiler option

-maxrregcount nnn

Topic		Replies	Views
Warp Size Question CUDA Programming and Performance	21	14271	June 18, 2010
Waiting for global memory access. CUDA Programming and Performance	32	56635	January 31, 2008
A question about warps and threadblock CUDA Programming and Performance	5	938	April 14, 2013
Whats a WARP for? CUDA Programming and Performance	8	6633	June 21, 2007
Deep understanding how block is actually processed in MP CUDA Programming and Performance	28	30467	December 15, 2010
Scalability evaluation CUDA Programming and Performance	12	4130	May 30, 2009
How to understand the "hide latency" CUDA Programming and Performance	13	4292	August 8, 2024
Global Memoy latencies and NVIDIA cards Latency CUDA Programming and Performance	15	8973	January 11, 2008
Block size and occupancy CUDA Programming and Performance	12	656	January 2, 2025
Question regarding warp efficiency... CUDA Programming and Performance	9	15225	March 13, 2007

Warp switching does anybody understands the mechanism

Related topics