Instruction timings More info than in the guide

Is there a G80 assembler manual that gives more detail on instruction timings than is available in the Guide?

Specifically integer div and mod are “particularly costly” but what is the cost?

The guide says to avoid div and mod, but if that is going to cost 4 instructions then should one do this or not… probably not since future hardware implementations will probably fix integer div and mod. It would be good to know what the current cost is.

We could switch the code on a conditional if there was a standard for architecture defines in the compilation environment. Per another of my threads this has not been standardised yet.

Thanks,
Eric

You could introduce the define yourself, compile all the versions, store them in CUDAs code repository and select them at runtime. If you include the abstract machine target “compute_10” you can make it compile at runtime. You still would need to inspect the PCI ids to see what card is present then as I have not found a CUDA runtime API call for that (something like cgGLSetOptimalOptions would be cool to have). See the NVCC manual for details.

Peter

Hmmm… no answer so I guess Nvidia don’t want an official figure appearing in a competitive product matrix. So I asked someone to run this program:

#include <stdio.h>

#include <cutil.h>

#define RESULTS 3

__global__ void

testKernel(int* result)

{

    __shared__ int      t1, t2, t3, t4;

    __shared__ int      x, y;

   __syncthreads();

    t1 = clock();

    __syncthreads();

    x = t1 + 534;

    __syncthreads();

    t2 = clock();

    __syncthreads();

    y = x % 624;

    __syncthreads();

    t3 = clock();

    __syncthreads();

    y *= 789;

    __syncthreads();

    t4 = clock();

    __syncthreads();

    result[2] = y;                      // Result is used

    result[0] = t3 - t2 - (t2 - t1);    // Mod over add

    result[1] = t4 - t3 - (t2 - t1);    // 32mul over add

}

int

main(int argc, char** argv)

{

    int         result[RESULTS];

    int*        d_data;

   CUDA_SAFE_CALL(cudaMalloc((void**)&d_data, sizeof(int) * RESULTS));

   dim3  grid;

    dim3  threads;

    testKernel<<<grid, threads>>>(d_data);

    CUT_CHECK_ERROR("Kernel execution failed");

   CUDA_SAFE_CALL(cudaMemcpy(&result, d_data, sizeof(int) * RESULTS, cudaMemcpyDeviceToHost));

    printf("%d %d\n", result[0], result[1]);

   threads.x = 32;

    testKernel<<<grid, threads>>>(d_data);

    CUT_CHECK_ERROR("Kernel execution failed");

   CUDA_SAFE_CALL(cudaMemcpy(&result, d_data, sizeof(int) * RESULTS, cudaMemcpyDeviceToHost));

    printf("%d %d\n", result[0], result[1]);

   return 0;

}

with the following results:

Now there appear to be a few deductions we can make from these results:

  1. clock() measures new GPU clocks (outstanding question in http://forums.nvidia.com/index.php?showtopic=35341) unless there is an even faster clock running around the GPU!

  2. Mod really is expensive, at first blush looks like nearly 60 times a normal instruction.

  3. Having unused warps actually costs you at the hardware scheduler level - 1 clock for each unused warp for each instruction cycle, actually looks like all instruction cycles are 5 clocks if running and 1 clock if the warp is not running.

Any other takes on these results?

Eric

ed: I am told by others that they are getting quite varying results for mod - no doubt it is an iterative algorithm that stops when it has the answer. Dependant upon inputs, and one of them here is the clock which wraps through 0 every 4 seconds @ 1GHz.

ed: The numbers above were generated by pre-release tools and are new GPU clocks - looks like 0.8 numbers will come out 1/2 these.

I notice the line y *= 789 may have a write after read dependency that was not in the other code segments, making them not exactly comparable. Perhaps it should be changed to z = y * 789 and re-run. Any takers?

Incase you were wondering, the syncs are there just to stop the compiler re-ordering the instructions, and force writing of shared memory - they don’t do anything in 1 warp at run time.

Eric

Last instalment for the moment, a puzzle instead of your Sunday crossword - updating the test to:

#include <stdio.h>

#include <cutil.h>

#define RESULTS 4

__global__ void

testKernel(int* result)

{

    int                 t1, t2, t3, t4;

    __shared__ int      x, y, z;

   z = clock();

    __syncthreads();

    t1 = clock();

    __syncthreads();

    x = z + 534;

    __syncthreads();

    t2 = clock();

    __syncthreads();

    y = x % 624;

    __syncthreads();

    t3 = clock();

    __syncthreads();

    z = y * 729;

    __syncthreads();

    t4 = clock();

    __syncthreads();

   if (threadIdx.x == 0)

    {

        result[3] = z;                          // Result is used

        result[0] = t2 - t1;                    // Total add/sync clock shared read&write time

        result[1] = t3 - t2 - (t2 - t1);        // mod over add

        result[2] = t4 - t3 - (t2 - t1);        // mul32 over add

    }

}

int

main(int argc, char** argv)

{

    int         result[RESULTS];

    int*        d_data;

   CUDA_SAFE_CALL(cudaMalloc((void**)&d_data, sizeof(int) * RESULTS));

   dim3  grid;

    dim3  threads;

    testKernel<<<grid, threads>>>(d_data);

    CUT_CHECK_ERROR("Kernel execution failed");

    testKernel<<<grid, threads>>>(d_data);

    CUT_CHECK_ERROR("Kernel execution failed");

   CUDA_SAFE_CALL(cudaMemcpy(&result, d_data, sizeof(int) * RESULTS, cudaMemcpyDeviceToHost));

    printf("%8d%8d%8d%8d\n", threads.x, result[0], result[1], result[2]);

   for (threads.x = 32; threads.x <= 512; threads.x += 32)

    {

        testKernel<<<grid, threads>>>(d_data);

        CUT_CHECK_ERROR("Kernel execution failed");

       CUDA_SAFE_CALL(cudaMemcpy(&result, d_data, sizeof(int) * RESULTS, cudaMemcpyDeviceToHost));

        printf("%8d%8d%8d%8d\n", threads.x, result[0], result[1], result[2]);

    }

    return 0;

}

gives on 0.8 release (so double every number for new GPU clocks):

      1      52     908      48

      32     172     914      50

      64     180     908      50

      96     180     904      50

     128     188     912     162

     160     310     914      46

     192     326     964      48

     224     366    1008       2

     256     338    1082      68

     288     386    1138      74

     320     450    1174      68

     352     470    1286      76

     384     488    1404     120

     416     532    1502     126

     448     584    1558     114

     480     700    1652      68

     512     646    1798     150

      1      52     892      48

      32     172     914      50

      64     180     906      50

      96     180     912      50

     128     188     918     162

     160     310     936      46

     192     326    1000      48

     224     366     956      62

     256     338    1048      70

     288     380    1134      80

     320     450    1190      64

     352     470    1310      74

     384     488    1412     106

     416     532    1508     126

     448     578    1616     120

     480     692    1658      76

     512     646    1776     158

There is lots of info one can deduce from these numbers, some of which goes against the obvious interpretation of the guide.

The single threaded run cannot have any register conflicts, shared memory conflicts or anything else like that so it is likely pretty clean. If we trust the guide that mul32 takes 4x as long as add then we can conclude the minimum instruction cycle time of an MP core is 32 new GPU clocks. 8 times slower than one is led to assume. No wonder single threaded code runs so slowly… As warps are added things don’t change a lot till those 32 clocks are used up. Makes me think that when a warp stalls it probably still uses instruction cycles.

Given this assumption the mod op definitely looks like 58 or so instruction cycles.

The other interesting thing is the time for the add single threaded is only 104 clocks (clock to clock) and the mul32 time is an additional 96 so we can conclude the kernel launcher deleted the bar.wait instructions, as it is entitled to do within a warp (does mean that the code is being reloaded on each kernel call - latency) but there are only 8 clocks for the load clock into register op… only thing that does not add up??

Next interesting thing is the add clock-clock time for 1 warp is incremented by 240 new GPU clocks over the single threaded version. Now the launcher still would not have put the bar.wait instructions back in yet and since this app uses 14 registers the loader should have allocated them to separate register banks so there should be no register bank conflict (assumes 16+ register banks and same technology as shared memory) there should be no read after write or write after read conflicts as there is plenty of time between setting and using registers and shared memory. This leaves the only possibility is shared memory write conflict. Here we have a 32 way bank conflict with a full register write cycle time - gives 2 whole instruction times as the write cycle time of shared memory (8 clocks) and blows my theory that if one was designing hardware to do this one has to decide which order to do the writes therefore one would just do the last, as all the others are redundant. This is the second big surprise.

Any other ideas or further interpretation as number of warps increase? There is certainly a lot going on here and it is a complex set of equations to solve. (I did not run this code).

Eric

ed: Just though a good reason for the same time for mul32 and add at one run of 224 threads is that the hardware detects top bytes zero and does a mul24 for you so there is no need to explicitly code mul24…

ed: (to save a post) quick revisit and my suggestion is that at less than 33% occupancy the G80 is throttled by instruction fetch bandwidth. Seems to go into a funny mode like a super warp of 8 warps (not strictly but 8 warps are run taking the full instruction cycle times even though as many as 7 are doing nothing). The load clock does add up as it does not need ALUs (direct broadcast of the hardware clock reg to 32 thread registers) and so it can happen in 1 clock which implies 32 banks in the register store - believable as it is twice as big as shared memory and an obvious design decision to minimise conflicts.

The write conflict to shared mem is definitely there and appears to take 32 clocks for a 32 way bank (bus) conflict. It seems the hardware has to get all the data to decide what to write (as suggested by Good programing thread and the current compiler’s use of these concurrent writes, even when it is explicitly coded as a single thread write - could actually be a bug). Not as bad as mentioned above, but good to know the cost as 8x cost of writes to separate shared memory locations without bank conflict, if all threads are running (code not divergent). I expect this delay could get much less in future revisions of the hardware, if the bus conflict cycles can be aborted as soon as the result is known. This may already be the case, and writing the same value from all 32 threads is the worst case scenario requiring all data to be checked.

Along the way I concluded that syncs do not cost 4 clocks, they only cost to the last warp to reach them and then probably less than 4 clocks.

Very interesting findings, osiris. I have experienced that there is something going on additionally at kernel startup, see the end of this thread. Did you try to compare timings with the same #threads for different #blocks?

Peter