Weird thing on my GTX 460

Below is the kernel I use to run. It manually repeats the operation f[0]++ one thousand times in one for loop and executes the for loop for 10,000 times.

#define org f[0]++;

#define _10(o) o o o o o o o o o o

__global__ void test1(float *dev)

{

	float f[3];

	for(int j=0; j<3; j++)

		f[j] = 0;

	for(int i = 0; i<10000; i++) //10000 = CYCLES / 1000

	{_10(_10(_10(org)))}

	*dev = f[0];

}

//my main function has something like this:

	cudaEventRecord(start);

	test1<<<6,16*32>>>(f);

	cudaEventRecord(stop);

	cudaThreadSynchronize();

	float ms;

	cudaEventElapsedTime(&ms, start, stop);

	printf("clocks per operation: %f\n", 1560000*ms/10000000.0f);   //1560000 is the no. of clock cycles per microsecond. 10000000 is the number of times that the addition operation is repeated in each thread

the launch configuration of <<<6, 1632>>> gives the exact output I’m looking for: around 16.07, which is the correct no. of cycles that an addition operation takes to complete. But the moment I change it to <<<7, 1632>>>, the output becomes 32.1+, suggesting not all blocks are running in parallel. But, hey, doesn’t GTX 460 have 7 MPs? Shouldn’t all the seven blocks be running at the same time, each occupying one MP? I was expecting the output to rise to 32 only when I had changed the configuration to <<<8, 16*32>>>… now the result really confuses me. Anyone knows why it behaves like this?

Do you happen to have the GTX 460 SE with only 6 MPs?

It is shown in the visual profiler that it has 7 MPs and 1024MB of global memory so it can’t be the SE. Also, the 460 I use is a secondary graphics card running on my notebook. The OS probably doesn’t use the secondary card, does it (the primary is a discrete ATI X1600)? Also I have disabled the graphics driver resetting when execution time is too long.

Visual profiler could be wrong about the number of MPs. What is the PCI ID of the card (you can see it through the device manager in Windows). And how did you manage to stick the 460 into the notebook, it’s a desktop card? Maybe what you have is the 460M?

Also, one thing that you seem to be taking for granted but I find strange, is that you’re seeing 32 additions per MP per clock. But the 460 is a compute capability 2.1 chip, you should be getting 48, not 32.

In fact, I tried that code, on my 560 (compute capability 2.1), I see the right number of MP’s (8), and the right clock (1.7 GHz), but I too see only 32 additions per MP per clock.

EDIT: I think I’m starting to understand the part about 32 vs 48 ops per clock. It’s happening because instructions in that kernel can only be executed serially (addition #X has to complete before addition #X+1 can be issued). Two warp schedulers per SM, each tries to issue two instructions every two clocks; which could mean 64 ops/clock, if there were enough cores to do that. But serial code means that each scheduler can only issue 1 instruction per two clocks.

Try to increase the number of threads per block to 768 or 1024. Do you still see the same scaling? 512 may not be enough to hide all register read-after-write dependencies.

My card uses 5x amount of power as my laptop. Please, it’s not an M card. I connect it to my laptop using a ExpressCard slot and a PCIe adapter. The PCIID is 10de:0e22

Using 16 warps means each warp scheduler deals with 8 warps, taking 2*8=16 cycles to complete issuing a single instruction across all warps. Since the arithmetic latency for addition is 16, there is perfect overlapping and no stalling in this case. However, anything below 16 warps, in my code, will cause incomplete overlapping… Though it doesn’t matter how many warps I use, the second dispatch unit in both warp schedulers are always idle because they need ILP to run and I have no ILP here.

EDIT: increasing the number of warps in this case will cause each block to take longer to finish. If there is less than 7 blocks, then the output from my code should be ceiling(n/2)*2, n is the number of warps and n>16

but this behaviour is understood. I just don’t know why my kernel only gets to run on 6 MPs. It’s a secondary card with no other load!

This is becoming weird. You could record the SM ids to see whether 6 or 7 different SMs are used.

Where did you get this 16? The relevant number here is write-to-read dependency latency, which, according to CUDA best practices guide, is 24.

If you check again… you’ll probably notice it’s written as “16 - 24” cycles… or if not… there are plenty other documents that support my claim here…

this document says for GF100 and 104 it’s roughly 18 cycles…

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

I remember reading in some other official document that arithmetic latency is 16-24 clocks, unless i’m having a memory failure

I tried the method. Result is like this:

SMID No. of Threads

0 512

1 1024

2 512

3 512

4 0

5 512

6 512

7 0 //this one does not exist, so being 0 is correct

And it’s not random. I tried many runs, it’s always wrong with smid 1 and 4

When I increase the number of blocks to 14, things become completely normal. MP 0 to 6 each executes 1024 threads.

Does this indicate a flaw in Fermi’s kernel dispatching engine? This behaviour effectively disables me from using 7 blocks to increase the size of shared mem/block and regs/thread.

Also, MP 1 seems to take priority over MP 0 in terms of block execution. When I run a single block only, it always gets executed on MP 1 instead of MP 0.

… did a bit more test to see in what order the MPs are assigned blocks, here’s the result

1 3 5 0 6 2 1 //the first seven blocks

4 3 0 5 2 6 4 //block 7 - 15

1 0 3 2 5 4 6

The above means that when only one block is launched, MP 1 is used. When 2 blocks are launched, MP 3 joins the task, when 3 blocks are launched, MP 5 joins, and so on

I agree that is not optimal. Although I just realize that it’s not easy to implement a completely fair block scheduler. On compute 1.x devices (where blocks are scheduled strictly round-robin, and new waves of blocks only start when all previous waves are finished) I used to run my own block scheduler, as my blocks have largely varying runtimes. And I realize that my implementation would not be fair under there conditions either.

I’d doubt though that this prevents you from using the full amount of shared mem and registers. On the contrary, increased ressource usage might even out the scheduling: You could try allocating so much shared memory that only one block can run per SM at any time (you don’t even need to add a dummy smem variable, just add the amount of shared memory as/to the third argument between the <<< >>>). I’d expect that to balance the SM use.