Test Multi Threading Spinning

Though ultimately this seems unlikely, because as the kernel gets run many times, the number of malfunctioning threads keeping spinning on the hardware would increase rapidly, but it was an interesting/funny idea External Image :)

Hmm now that I inspect version 0.02 with the thread dimensions (32,1,1)

It seems not block 9 but block 1 or so is hanging… hmm this is kinda strange going to investigate further External Image

Ok, another lessons learned, I have seen this mentioned before, but now I saw it in action myself:

When the test program is run at full speed it can ofcourse never complete, when break all is chosen to start inspecting what’s going on the following might show:

Cuda simply assumes that it can divide up the work any way it wants…

So there is no garantuee that block 0 to 10 will be executed… when it run full speed it seems to select block0, block3, block4 and so forth.

So for some reason it skips block1 and block2, almost seems like a little start-up lag issue which seems to make it skip block1 and block2.

So this explains what was happening and why the index was stuck at 32, block1 was simply missing and couldn’t continue the work.

So another important lesson learned:

Each debugging session could be different with these certains of problems… if a break point is set immediatly at the start then it seems the blocks selected would be a bit different.

So each debugging session can produce different results/hang situation which might make it a bit more difficult to debug…

As I mentioned before:

I don’t quite understand how this constraint could be violated:

“Maximum number of resident warps per multiprocessor = 48 for compute 2.0”

If a “half-warp” is also considered a “resident warp” then perhaps a “half-warp” example could violate this constraint:

48 * 16 threads = 768 threads.

So a thread block of 1024 threads using 49 half warps might be capabile of creating a dead lock somehow.

I can imagine that this would require each half-warp to keep spinning somehow and never go to the next bunch of threads waiting for execution.

You can’t do this at all. Scheduling dependencies between blocks will fail.

My system just mysteriously frooze/screen went black, right before that there was disk drive activity probably the microsoft virus scanner. I suspect the deadlocking of the gpu had something to do with it.

So all this experimenting is not without risk to hardware, fortunately my motherboard and graphics card kinda cheap so not a huge issue, but the rest of the system was quite expensive, so I do hope everything else stays ok.

Anyway.

I think I now also understand better why the random access memory test performed so bad.

The random memory access test does 1 memory access per thread.

So let’s assume 32 threads are executed in parallel this means 32 memory accesses per clock cycle.

The multi processor only has room for 1024 threads. Because the first 32 threads stall immediately it switches to the next warp.

So 1024 / 32 = 32.

This means after 32 clock cyles all thread contexes have been used up… and all 1024 threads are now stalled waiting for memory.

The memory latency is said to be about 600 clocks cycles.

So 600 - 32 = 568 clock cycles cuda is waiting and doing nothing :(

If thread resources was higher for example then it would be:

1536 / 32 = 48 clock cycles… 600-48 still a lot of waiting time.

This even assumes worst case scenerio, in reality it probably executes 48 threads in parallel.

So real numbers are probably:

1024 / 48 = 21 clock cycles.

After 21 clock cycles all threads are stalled and waiting for memory :(

So an interesting question for hardware developers would be:

“How many thread contexes/resources does cuda need to completely hide memory latency ?”

Let’s leave branches and other slightly instruction overhead out of equation.

Assuming cuda issues 48 memory requests per clock cycle then it’s a pretty easy formula:

cuda cores * memory latency = number of thread contexes needed.

So in this case:

48 * 600 clock cycles = 28800 thread contexes.

So cuda should at least have 28800 thread resources per multi processor to completely hide memory latency.

This would be the best case/extreme case.

In reality perhaps some clock cycles per memory request are spent on branching or increasing an index or so…

Still having it maxed out would be nice.

Now let’s compare best case to current situation:

28800 / 1024 = 28 clock cycles.

Cuda assumes that each thread will spent 28 clock cycles on overhead.

For my ram test this is probably not the case… and the overhead is perhaps 3 clock cycles or so… maybe even less…

So at least to me cuda seems “thread contexes/resources” starved at least for random access memory.

This seems to be the bottleneck for now, once this bottleneck is lifted in future, perhaps only then dram 32 byte memory transaction size would become a limit.

But for now, cuda seems thread resources starved :(

Hmm, now I am not so sure anymore, by changing the threads per block from 1024 to 256 according to the occupancy calculator this should max out the number of threads being used on the multi processor which would be 1536 instead of just 1024.

This should have given higher ammount of memory transactions per second, but it didn’t… so perhaps bottleneck is somewhere else…

I am also unsure why 256 threads per block would give 100% occupancy for compute 2.1 ?!?

Ok, this is a bit whacky but here goes, there are apperently further constraints as follows:

Maximum number of resident warps per multiprocessor = 48 for compute 2.0

^ This number is the number of groups (each group being 32 threads, so a total of 48x32 = 1536 threads).

However each multi processor can only have 8 blocks, since warps are responsible for executing the blocks, the warps need to be distributed over the blocks so this gives:
(Maximum number of resident blocks per multiprocessor = 8 for compute 2.0)

So this gives following formula:

MaxResidentWarps / MaxResidentBlock = MaxResidentWarpsPerBlock.

So plugging in the numbers gives:

48 / 8 = 6 resident warps per block.

Since each warp has 32 threads this gives:

6x32 = 192 resident threads per block.

Since there are 8 blocks this gives: 8 x 192 resident threads = 1536 threads.

So the number 256 threads per block probably wasn’t optimal. Maybe the calculator was wrong or maybe it used some extra threads available or maybe I made mistake in formula’s above, when I first did some calculations with calculator 256 seemed to make sense but now I don’t make that much sense to me anymore…

I am going to give 192 a try and see what happens, so far the occopany calculator still says: 100%

Well these constraints cause multiple optimal solutions at least when it comes to occupancy.

So far 128, 192, 256 all give 100% occupancy though 192 seems to perform slighty worse then the rest.

128 probably not optimal, google didn’t refresh the results I think… 128 threads per block, would give too many blocks: 12.

So it’s either: 1536 / 256 = 6 blocks each of (48/6) = 8 warps = 8 * 32 = 256 threads again.

or

1536 / 192 = 8 blocks each of (48/8) = 6 warps = 6 * 32 = 192 threads again.

The complete list of optimal occupancy for thread block size is:

192, 256, 384, 512, 768

This is pretty easy to try out:

1536/8 = 192
1536/7 = bad
1536/6 = 256
1536/5 = bad
1536/4 = 384
1536/3 = 512
1536/2 = 768

Number of threads cannot exceed 1024 so /1 falls off.
Number of blocks cannot exceed 8 so /9 and above falls off.

Some divisions lead to fractions so those fall off.

Which leaves the 5 solutions above.

However the warps must also be distributed across the blocks so further
calculations could be done to see if it’s nicely distributed, just to make
sure each block completes within the same time, this is probably not
a requirement but it’s interesting anyway:

48 / 8 = 6
48 / 6 = 8
48 / 4 = 12
48 / 3 = 16
48 / 2 = 24

So surprisingly even 3 produces nice warp distribution ! External Image =D

1536 (maximum resident threads per multi processor)
8 (maximum resident blocks per multi processor)
48 (maximum resident warps per multi processor)