debug of shared memory with cuda-gdb

Hi
I encountered a debugging problem, the following is part of my code:

...
  if(tid == 0){
#pragma unroll
        for (int s = 0; s < NUM_GUESS; s++) {
            printf("block id is %d, s is %d, state is %d\n",blockIdx.x, s, states[s]);
        }
    }
    
    /* Write local results into shared memory*/
#pragma unroll
    for (int s = 0; s < NUM_GUESS; s++) {
        int t_index = s * BLOCK_SIZE + tid;
        s_states[t_index] = states[s];
        if(tid == 0){
            printf("immediately, block id is %d, index is %d, state is %d\n", blockIdx.x, t_index, s_states[t_index]);
        }
    }

    __syncthreads();

    if(tid == 0){
#pragma unroll
        for (int s = 0; s < NUM_GUESS; s++) {
            int t_index = s * BLOCK_SIZE + tid;
            printf("block id is %d, s is %d, index is %d, state is %d\n", blockIdx.x, s, t_index, s_states[t_index]);
        }
    }

Here, I write values of states into shared memory. (Since the size of array states is small and access pattern is static, it is in registers.) Then, after a sync, I read the value stored in shared and check the consistency.

I tested my program with one thread block launched , the gpu is P100.
compilation instruction is :
/usr/local/cuda/9.2.88/bin/nvcc -gencode arch=compute_60,code=sm_60 -std=c++11 -O3 -o parallel_shared_2 parallel_shared_2.cu

parallel_shared_2.cu is the source file.
The output of this code section is as follows:
block id is 0, s is 0, state is 1
block id is 0, s is 1, state is 1
block id is 0, s is 2, state is 4
block id is 0, s is 3, state is 6
immediately, block id is 0, index is 0, state is 1
immediately, block id is 0, index is 1024, state is 1
immediately, block id is 0, index is 2048, state is 4
immediately, block id is 0, index is 3072, state is 6
block id is 0, s is 0, index is 0, state is 1
block id is 0, s is 1, index is 1024, state is 8
block id is 0, s is 2, index is 2048, state is 4
block id is 0, s is 3, index is 3072, state is 6

Here, the second element is changed into 8 instead of 1.

Then, I tried to use cuda-gdb to debug this problem.
I changed the compilation instruction into:
/usr/local/cuda/9.2.88/bin/nvcc -std=c++11 -g -G -O0 -o parallel_shared_2 parallel_shared_2.cu

I received this info:
warning: Cuda API error detected: cudaLaunchKernel returned (0x7)

I think this is because when I use this compilation instruction, there is no/few register reuse. As a results, each thread requires too many registers and we can not launch the kernel.

Any suggestions for me to debug this problem?

Thanks a lot!
Yang Xia

If you believe register usage is holding you up in a particular compilation setting, you can instruct the compiler to limit its register use. Use the -maxrregcount switch, which is documented in the nvcc manual.

Thanks for your help!