I use 31 64-bit register , 2 32-bit and 4 16-bit register. Those. in the amount of it
66 32-bit registers. A compiler, in addition to the 63 available registor still uses 120 bytes of local memory. Although I do not use it at all. And why much as much as 120 bytes? Is very delays execution procedure. Is there any way to specify which registers to put it in the register memory?
It’s clear. Question: Why does using 68 registers, the compiler allocates much as much as 120 bytes of local memory? When the report has to only 68-63 = 5 * 4 = 20 bytes. Just Explain why? You understand that the local memory is not cached, and is very slow compared to register. Therefore, the smaller the compiler will use local storage, the better. I just do not see a pattern between the number of registers used and the allocation of local memory.
point 3 interesting. It turns out that the GPU will add their own registers in which he calculates to my 68 registers? And how will I know how much the GPU will use the registers for the calculation, for example, mul.wide.u16 temp, blockIdx, blockDim; ?
Those. in fact, it is written that a maximum of 63 register for Compute Sapability 3.0 is all nonsense, because it is necessary to add still unknown Number of registers from the paragraph 2 and 3? :(
Removed from procedure the registers threadIdx, blockDim, blockIdx and still 120 bytes. The mystery of some sort.
threadIdx and so are allocated irrespective their actual usage, they are just passed from runtime into your kernel. ptx isn’t a final gpu code, you can compile to sass and look at the real code.
locals are cahed in L1 & L2 cache, but since GPUs has more register memory than cache memory, even L2 cache may not have enough size to keep them all. Try to reduce number of warps running on SM to see whether it will improve performance.
i suggest to use b32 for round, it may even decrease register pressure
also it seems not your entire kernel. usually functions are inlined on GPUs, and those 120 byte locals are allocated for entire kernel (function declared with global), not the single function. you entire kernel, obviously, loads data from memory and performs some address calculations. If keccak function perfroms a lot of computations, probably those address, loop counter and other variables are saved to the local memory when keccak rounds starts. so, the internal loop may work entirely from registers, while local memory used to store variables of calling procedure, like stack on CPUs
and btw, _Z4iotaPx_param_0 is passed via register too
Those. in fact, it is written that a maximum of 63 register for Compute Sapability 3.0 is all nonsense, because it is necessary to add still unknown Number of registers from the paragraph 2 and 3? :(
well, GPU has access to 63 registers (it’s a 6-bit field in the command), but they are shared between PTX code and everything around. if you will write entire kernel in SASS, you can use them all. but there is no official SASS compiler :)
And how will I know how much the GPU will use the registers for the calculation, for example, mul.wide.u16 temp, blockIdx, blockDim; ?
the only way is to see at SASS code. GPUs aren’t backward compatible, but PTX is. So, PTX commands may be emulated by multiple SASS commands. See “Multiple instructions” in “5.4.1. Arithmetic Instructions” of manual. Just for example, Maxwell emulates 32-bit multiplication with three multiply-add instructions having 16-bit inputs and 32-but result
Of course this is not the whole kernel. there are a lot of calculations. I just tried to write kessak function with the smallest possible set of registers. And it does not use SHARED Memory. Those. I wanted maximum performance is achieved by the function. Poor that there Sass compiler. And it turns writing one thing and get something quite different, not what I expected.
it seems that you know CPUs and use your knowledge to interpret your GPUs experience. But hey, they are different! PTX is not machine code, it’s more like “portable assembler”, compatible between GPU generations. There is no stack to save registers of calling procedure, local memory used instead. And registers/memory are allocated for entire kenel, not a single function call
I understand the difference between the CPU and the GPU. I wrote a function kessak 25 threads simultaneously, but get a big obstacle was the restriction on hashraite SHARED Memory. Those. 25 thread, I combined in a 1 block. On one block I used 440 byte Memory SHARED. Those. 49152/440 = 111 blocks maximum can be run simultaneously. On one block I got 80 K hash total 111 * 80 = 8.8 million hashes. It is not enough. So I went the other way. And calculates the hash in each thread is now hashraite 16000000 with 128 thread and 256 blocks.
But this is not enough
yes, it’s typical problem for gpus - if you try to perfrom a lot of computations per thread, you get out of shmem/registers. if you combine work of multiple threads, you have slow communication
look at using 2-16 threads per computation, so you will be at gold middle point
[1] PTXAS is a compiler, not an assembler. PTX serves as a compiler intermediate format as well as a virtual instruction set architecture, and all registers at PTX level are virtual: they get assigned to physical registers by the register allocation phase of PTXAS. Use of PTX does not give tight control over the generated machine code (SASS). You may want to look into third-party assembler tools, such as Scott Gray’s Maxwell assembler, to program directly at the SASS level. Note: SASS programming is not something I would recommend, for a plethora of reasons. In most cases, I wouldn’t even recommend programming at PTX level.
[2] Just like there are ABIs defined for various CPU architectures, there is an ABI defined for NVIDIA GPUs. This is needed to support, for example, separate compilation, device-side printf(), and many C++ features. This ABI is used by default, unless you explicitly tell the compiler otherwise. The ABI does include the use of a stack. Use of the ABI may not be obvious when examining the machine code for simple programs consisting of a single compilation unit as the CUDA compiler will optimize out most ABI overhead if it is not needed, in particular through whole-code optimizations when there is no separate compilation requested. But some vestigial instructions (such as the setting up of a stack pointer) can sometimes still be observed even in that case.