Hi
I encountered a debugging problem, the following is part of my code:
...
if(tid == 0){
#pragma unroll
for (int s = 0; s < NUM_GUESS; s++) {
printf("block id is %d, s is %d, state is %d\n",blockIdx.x, s, states[s]);
}
}
/* Write local results into shared memory*/
#pragma unroll
for (int s = 0; s < NUM_GUESS; s++) {
int t_index = s * BLOCK_SIZE + tid;
s_states[t_index] = states[s];
if(tid == 0){
printf("immediately, block id is %d, index is %d, state is %d\n", blockIdx.x, t_index, s_states[t_index]);
}
}
__syncthreads();
if(tid == 0){
#pragma unroll
for (int s = 0; s < NUM_GUESS; s++) {
int t_index = s * BLOCK_SIZE + tid;
printf("block id is %d, s is %d, index is %d, state is %d\n", blockIdx.x, s, t_index, s_states[t_index]);
}
}
Here, I write values of states into shared memory. (Since the size of array states is small and access pattern is static, it is in registers.) Then, after a sync, I read the value stored in shared and check the consistency.
I tested my program with one thread block launched , the gpu is P100.
compilation instruction is :
/usr/local/cuda/9.2.88/bin/nvcc -gencode arch=compute_60,code=sm_60 -std=c++11 -O3 -o parallel_shared_2 parallel_shared_2.cu
parallel_shared_2.cu is the source file.
The output of this code section is as follows:
block id is 0, s is 0, state is 1
block id is 0, s is 1, state is 1
block id is 0, s is 2, state is 4
block id is 0, s is 3, state is 6
immediately, block id is 0, index is 0, state is 1
immediately, block id is 0, index is 1024, state is 1
immediately, block id is 0, index is 2048, state is 4
immediately, block id is 0, index is 3072, state is 6
block id is 0, s is 0, index is 0, state is 1
block id is 0, s is 1, index is 1024, state is 8
block id is 0, s is 2, index is 2048, state is 4
block id is 0, s is 3, index is 3072, state is 6
Here, the second element is changed into 8 instead of 1.
Then, I tried to use cuda-gdb to debug this problem.
I changed the compilation instruction into:
/usr/local/cuda/9.2.88/bin/nvcc -std=c++11 -g -G -O0 -o parallel_shared_2 parallel_shared_2.cu
I received this info:
warning: Cuda API error detected: cudaLaunchKernel returned (0x7)
I think this is because when I use this compilation instruction, there is no/few register reuse. As a results, each thread requires too many registers and we can not launch the kernel.
Any suggestions for me to debug this problem?
Thanks a lot!
Yang Xia