Threadless programming model

Is there a consistent documentation of the latest GPU architecture without misleading “thread” model, with array operations, registers description and their latency for different operations ?

Do you mean thinking of warps as distinct “processes” rather than threads? If so, there isn’t other documentation than the normal, but there aren’t many differences between the two. If you’re looking for latency of SASS instructions, they are usually 6 because of Pascal’s pipeline depth, but some take longer (feel free to ask me for any specific ones). In general, all registers basically function the same (unlike x86). What do you mean by array operations?

In general, to date, the machine code (SASS) is not well documented. There are (for the most part) no published latency numbers either. You also don’t have an NVIDIA-provided assembler, so programming in machine code would require a 3rd-party assembler.

The whole concept of threads on GPU is quite misleading. I may as well state that for 128bit register machine it runs 128 threads 1 bit wide each, but that is not true actually, it operates with a single 128 bit register. It may be placed at different parts, but as long as it is a single operand, it is still a single register. No matter how you split it, it is still 128 bit register, and NOT threads.

There should be some documentation of array operations, which would be close to truth, such as actual register size, array element size etc. I’m thinking of “warps” as single array operations, and not actual threads by any means. As long as the same instruction is executed on the whole array, it represents 1 thread and 1 register, as an array.

Yes, each register can be thought of as 128 bytes long, but keep in mind that operations on those registers still only will see the smaller values within it. For example, calling FMUL will take multiply each 4 byte chunk of two 128 byte registers together and store in another 128 byte register. Thinking in terms of threads is fine for most things, but you just have to be careful and remember that warps exist for control flow and memory stores/loads (to be efficient at least).

That is what I meant, I need some consistent documentation on array registers rather than warps or whatever, because I find them highly misleading.
FMUL.i32 r1[0…31], r2[0…31], r3[0…31]
Something like that, which I find correct as opposite to fake “threads”.