Weird behaviour with static shared memory with short ints

I was debugging a kernel and I realized the CUDA changes the type of the short to unsigned short when it accesses a value the first time. To overcome this behaviour, I needed to explicitly define shared array as signed short.

Smallest reproducable code snippet:

__shared__ short s_array_1[ARRAY_SIZE]; //P.S I have many shared arrays
__shared__ short s_array_2[ARRAY_SIZE];

short value_1 = s_array_1[sh_index];  //here the value of s_array_2 at index 0 is -5
short value_2 = s_array_2[sh_index];  //before executing this line, s_array_2 changes to unsigned(i.e value is 65531)

I am not sure if this is a bug, or it is a result of compiler optimization.

Edit: even after changing it to signed short, it is behaving instable. Cuda compiler might be allocating a sigle register for 2 shorts, I will investigate that.

Windows, Visual Studio. CUDA 10.1. Turing architecture(RTX 2080TI)

First, your comment on line 4 doesn’t align with the code.

short value_1 = s_array_1[sh_index];  //here the value of s_array_2 at index 0 is -5

Second, you should always initialize shared memory before reading it. If not, you are reading whatever was previously stored in that memory bank.

The comment there says that s_array_2[0] was -5 when the debugger was at that line. When the debugger goes to next line the value changes to 65531(even before accessing it). Weirdness is exactly there. I also initialize all values to 0 before writing or reading to/from shared array.

Oh okay, I see now. Looking at the SASS of your code snippet. Nothing is really happening. Again I’m pretty sure the debugger is just retrieving junk memory.

.headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;              /* 0x00000a0000017a02 */
                                                                       /* 0x000fd00000000f00 */
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;       /* 0x000000fffffff389 */
                                                                       /* 0x000fe200000e00ff */
        /*0020*/                   EXIT ;                              /* 0x000000000000794d */
                                                                       /* 0x000fea0003800000 */
        /*0030*/                   BRA 0x30;                           /* 0xfffffff000007947 */
                                                                       /* 0x000fc0000383ffff */
        /*0040*/                   NOP;                                /* 0x0000000000007918 */
                                                                       /* 0x000fc00000000000 */
        /*0050*/                   NOP;                                /* 0x0000000000007918 */
                                                                       /* 0x000fc00000000000 */
        /*0060*/                   NOP;                                /* 0x0000000000007918 */
                                                                       /* 0x000fc00000000000 */
        /*0070*/                   NOP;                                /* 0x0000000000007918 */
                                                                       /* 0x000fc00000000000 */

This is how it looks in SASS. When I read the shared unsigned short to unsigned integer, there is no problem. I don’t quite understand the lines with PRMT in SASS but I suspect that it is using a single register for 2 short values and debugger cannot handle that.

unsigned short val_1 = s_array_1[sh_index];
0x00000165180d2820               MOV R6, R17  
0x00000165180d2830               SHF.R.S32.HI R7, RZ, 0x1f, R6  
0x00000165180d2840               MOV R6, R6  
0x00000165180d2850               MOV R7, R7  
0x00000165180d2860               MOV R8, R6  
0x00000165180d2870               MOV R9, R7  
0x00000165180d2880               MOV R8, R8  
0x00000165180d2890               MOV R9, R9  
0x00000165180d28a0               SHF.L.U64.HI R9, R8, 0x1, R9  
0x00000165180d28b0               SHF.L.U32 R8, R8, 0x1, RZ  
0x00000165180d28c0               MOV R3, 0x6400  
0x00000165180d28d0               MOV R3, R3  
0x00000165180d28e0               MOV R3, R3  
0x00000165180d28f0               MOV R3, R3  
0x00000165180d2900               MOV R5, RZ  
0x00000165180d2910               MOV R6, c[0x0][0x18]  
0x00000165180d2920               MOV R7, c[0x0][0x1c]  
0x00000165180d2930               IADD3 R6, P0, R3, R6, RZ  
0x00000165180d2940               IADD3.X R7, R5, R7, RZ, P0, !PT  
0x00000165180d2950               IADD3 R6, P0, R6, R8, RZ  
0x00000165180d2960               IADD3.X R7, R7, R9, RZ, P0, !PT  
0x00000165180d2970               MOV R6, R6  
0x00000165180d2980               MOV R7, R7  
0x00000165180d2990               MOV R6, R6  
0x00000165180d29a0               MOV R7, R7  
0x00000165180d29b0               LD.E.U16.SYS R6, [R6]  
0x00000165180d29c0               PRMT R6, R6, 0x7610, R6  
0x00000165180d29d0               PRMT R18, R6, 0x7610, R18  
			unsigned short val_2 = s_array_2[sh_index];
0x00000165180d29e0               MOV R6, R17  
0x00000165180d29f0               SHF.R.S32.HI R7, RZ, 0x1f, R6  
0x00000165180d2a00               MOV R6, R6  
0x00000165180d2a10               MOV R7, R7  
0x00000165180d2a20               MOV R8, R6  
0x00000165180d2a30               MOV R9, R7  
0x00000165180d2a40               MOV R8, R8  
0x00000165180d2a50               MOV R9, R9  
0x00000165180d2a60               SHF.L.U64.HI R9, R8, 0x1, R9  
0x00000165180d2a70               SHF.L.U32 R8, R8, 0x1, RZ  
0x00000165180d2a80               MOV R3, 0x9600  
0x00000165180d2a90               MOV R3, R3  
0x00000165180d2aa0               MOV R3, R3  
0x00000165180d2ab0               MOV R3, R3  
0x00000165180d2ac0               MOV R5, RZ  
0x00000165180d2ad0               MOV R6, c[0x0][0x18]  
0x00000165180d2ae0               MOV R7, c[0x0][0x1c]  
0x00000165180d2af0               IADD3 R6, P0, R3, R6, RZ  
0x00000165180d2b00               IADD3.X R7, R5, R7, RZ, P0, !PT  
0x00000165180d2b10               IADD3 R6, P0, R6, R8, RZ  
0x00000165180d2b20               IADD3.X R7, R7, R9, RZ, P0, !PT  
0x00000165180d2b30               MOV R6, R6  
0x00000165180d2b40               MOV R7, R7  
0x00000165180d2b50               MOV R6, R6  
0x00000165180d2b60               MOV R7, R7  
0x00000165180d2b70               LD.E.U16.SYS R6, [R6]  
0x00000165180d2b80               PRMT R6, R6, 0x7610, R6  
0x00000165180d2b90               PRMT R19, R6, 0x7610, R19

The problem is solved by reading the values to integers. The problem seems to with the debugger, because it produces correct results with both using ints or shorts.

Unless you are using short2, only one short is stored in each register.

https://devtalk.nvidia.com/default/topic/720550/a-question-about-calculation-of-integer-or-short-integer-and-float-data/

Which debugger are you using?

I am not explicitly using short2. I am using VC++ and CUDA Nsight next gen debugger. I think, it is a bug with debugger that it considers shorts as ints and shows the full register value. But actual result with both debugging and without is the correct.

Thank you for the info and the link.