Maxwell architecture details


I have a two part question:

   1. Is there anyway to find the pipeline depth of each unit. The critical thing which I would like to find is whether the shared memory accesses are pipelined

   2. I see that Nvidia documentation contains the assembly instruction set. However I dont see the allowed srcs/dests per instruction. For eg. there is MOV instruction. Will it allow to move directly from Global to memory and Shared memory? Is there any documentation of this kind?

The ptx instruction set documentation is pretty close to the capabilities of shader assembly (sass). So that’s a good place to look.

You can also compile some simple cuda-c code and inspect the disassembled sass to get an idea how it works.

You can also check out my Maxwell assembler:

With it you can construct any micro benchmark you like to probe the details of the hardware. I’ve already done this myself to a large extent but haven’t had the time to document it yet. I was hoping nvidia would have released some lower level documentation by now.

Anyway, for shared memory accesses there is a fairly deep instruction queue (>40 shared across warps). For 32 bit mode they are issued every 4 clocks.

All memory instructions need to work to and from registers. So a global to shared copy would require 1 global load and 1 shared store instruction and a register to hold the intermediate value.