Warp switching does anybody understands the mechanism

As can be understood from CUDA decsription, when several warps are executed on one multiprocessor together, they are usually “switched” one by one. I mean: while one warp is waiting for global memory read, instruction of another warp can be executed in the same time. They call it “hiding of global memory latency”. The similar thing s hiding “read-after-write dependencies latency”.

I hope the previous statements are right. Please correct me if I do mistake.

My question is: does anybody know more details about this mechanism?
i.e.
What does time spent on switching depends from?
Does it matters if warps are from one block or from different blocks?
Something else? I would appreciate any related information.

Thanks in advance.

To answer your question above:

I usually use 32 threads per block and multiple blocks per multiprocessor and still get excellent performance. So, my personal opinion is that it does NOT matter which block owns a warp. With a WARP size as my block size, I just dont required __syncthreads() at all. The performance usually rocks because with 32 threads (1 WARP) – there is NO chance of race conditions at all. It makes my program logic simple too.

Ideally – you should query the warp size from the CUDA driver and then use it as your block size. Your portability goes for a toss if you hardcode 32. Thanks for Mark Harris for this tip.

I have no idea on the time spent on switching. In my understanding – it is just resetting the “blockIdx.x,.y,.x and threadIdx.x,.y.,.z” to the values determined by the WARP scheduler and then execute the instruction from the instruction pointer of that WARP. So, WARP scheduling must take very very minimal time.

Best Regards,

Sarnath

Well, the “context switching” should take some time?

Oh, I forgot that we have independent registers and shared mem for each warp, so there is no context at all, except “instruction pointer”.

About this. When you use 64 threads per block instead 32 it increases occupancy in 2 times (in case it is limited only by reg allocation, not by shared mem). Therefore max numbers of warps which can be ran simultaneously on MP is increased by 2 times. This can increase your performance in 2 times if there are global memory latency or read-after-write latency in your kernel.

(More details in my last post in this topic http://forums.nvidia.com/index.php?showtopic=49707&st=20).

Occupany increase does NOT necessarily mean increase in performance…

I work for financial algorithms. By using 32 threads – I eliminate double-buffering and thus reduce shared memory. I generally make sure that I run 6 blocks per multiprocessor. Thats good enough for me.

Increasing to 64 threads would mean – doubling shared memory for my kind of applications – coz, i need to handle race conditions – I use double-buffering to eliminate races (input buffer, result buffer – two buffers).

So, its totally upto the application. If you application is such that each thread computes independently then what you say is right.

BUT

Note that merely running multiple number of threads per multi-processor does NOT mean you are going to get performance. Because the underlying hardware can do only one-warp parallely. The momment you hide all your latencies completley then every extra warp you run on your MP contributes to nothing…

For my apps, I found that 6 WARPs was good enuf to hide global memory as well as register-hazard latencies… So, I stuck with 32 threads… It all depends on what suits your app.

For sure. That is that I meant. Sorry for misunderstanding.

Surely.

However, according to my personal experience, what concerns to:

The moment when you reach the warp per MP limit (occupancy*24 per MP) becomes earlier/

Well, thank you for answers :)

Right. The hardware switches from one warp to another on every single clock cycle with no overhead.

About the whole occupancy argument, the answer can be summed up: Your Milage May Vary_.

There are well documented cases (Saranth) where a block size of 32 is optimal. There are other well documented cases (my own included) where 64, 128, 192, 256, 512, and others in between have been optimal block sizes.

The ABSOLUTELY ONLY WAY you can know what is optimal for your algorithm is to BENCHMARK IT. I cannot stress this enough as parts of my code increase their running time by 100% if you get too far from the optimal block size. Don’t assume that “because someone said so on the forums” that any particular size is the correct one.

Well, probably I’m looking the wrong way.

I am just trying to understand behaviour one of my kernels (see image attached):

IMU almost constant timing is caused by latency hiding. And “jumps” occur when 24*occupancy limit is reached.

I am confused with the difference between the first “step” and the others.
First step seems to represent a linear dependce of working time from warps per MP. (Which seems to be logical: when we increase warps per MP, we win time hiding latency, but the whole time should increase a bit (even if the latency can hide almost whole time of additional warp)).
And the other steps seem to be constant. However, I expected them to have the same linear nature as the first one has.
11.JPG

Your Processing Time seems to increase with Increasing the number of warps…

Does it mean that your input-size is also increasing? Can you give details about shared mem usage per block and number of threads per block, Number of registers per thread?

What is your hardware? GTX?

This will help us understand the graph better.

Oops. I added reply instead of editing. Ignore this dummy post.

Does it mean that your input-size is also increasing?

Yes. Proportionally. I’ve forgot to mention it.

Each thread makes the constant amount of work.

Can you give details …

lmem = 0

	smem = 36

	reg = 28

Number of thread per block is 32. => occupancy is 0.17 and first jump occurs after warps per MP is more than 0.17*24 = 4.

I am using GeForce 8800 GTX, Intel Core 2 Duo 2.33

Note that the number of active blocks in a multi-processor is 8.

And, your smem,thread configuration allows 8 blocks to occupy the MP simultaneously.

So 8/24 warps is your occupancy = 0.3333. I dont understand why you say it is 0.17. Am I missing sthg here?

Since you have 16MPs, 816 = 128 blocks can simultaneously parallely run.
Lets say for your input 128 blocks take “k” seconds
then
128
2 blocks would take nearly “2*k” seconds. Because you are limited by a maximum parallel occupation of 128 blocks at any given time.

So, you will have jump for every 128 blocks on your system.
This is equivalent to 8 blocks Per MP == 8 Warps per MP.

That is why you see the JUMP after every 8 Warps per MP.

Also note that:
The dots in your graph does NOT correspond to the X-Axis points well. They look to be taken for “floating point” number of warps per MP – which is obviously wrong… Re-graph it. :-)

And, your smem,thread configuration allows 8 blocks to occupy the MP

simultaneously.

Why? Type given values into CUDA occupancy caclulator and you get 0.17.* And profiler after kernel runs gives the same value.

That is why you see the JUMP after every 8 Warps per MP

After 4 :)

Re-graph it. :-)

Graph is right. I probably should have given better description. See below.

I tested configurations with 8, 16, 24 32, 40, 48, 56, 64 etc blocks. (You remember that in my situation block == warp).

Of course having 24 blocks is stupid from the viewpoint of performance since we have 1 block per MP for “first” 8 MP and 2 blocks per MP for “other” 8 MP. On this graph this situation corresponds to 24 blocks /16 MP = 24 warps / 16 MP = 1.5. :) Of course I uderstand that is not correct to say that we have 1.5 warps per MP. But underlying meaning is obvious - we have 24 warps per 16 MP. And since CUDA allows to test such configuration - I tested it.

*Probably you have not CUDA occ. calc. right now - so I’ve attached the image.
occcalc.JPG

INteresting thing on CUDA occupancy…

2832 = 896 – This was the registers that I was expecting for a block
However the XLS sheet shows 1792 == 2
896 regs per block. Hence the occupancy of 0.17…

Now, I have read something on this somewhere… – about the 2x factor.

Can some1 kindly explain this pleasssse???

Now, I have read something on this somewhere… – about the
2x factor.
I’ve mentioned it here:
http://forums.nvidia.com/index.php?showtopic=49707&st=20

BTW, I said in this topic that 64 blocks per warp is better than 32 just because of this x2 factor.

Can some1 kindly explain this pleasssse???
Well its up to nVidia-people.

Aaah… I have been caught unaware… i have been running 32 threads per block and I was just assuming that 7 or 8 blocks run all the time… Ooops… If I dont have 6 blocks – register latency starts hitting me anyway…

Can some1 from nvidia tell sthg about this?

btw, Is there a way of reducing “register” count… ?? Like a compiler option that would optimize for number of registers…

Pg 16 of nvcc_1.1.pdf, compiler option

-maxrregcount nnn