Below i’m placing some little code. When i uncomment one line:
unsorted_thread=1;
i get an error: “unspecified launch failure”.
Why isn’t this code correct?
__global__ void check_order(elem *g_elems, sum* g_sums, int n_real,int n,int num_blocks,int num_blocks2){
const int threads_num=blockDim.x; // number of threads in each block
const int bid=blockIdx.x; // given block's number
const int thid=threadIdx.x; // thread's number in given block
const int thread_elems_num=n/threads_num; // number of elements in ech thread
const int begin=bid*n+thid*thread_elems_num;
extern __shared__ int absolute_shared[];
int* shared=(int*)&absolute_shared[0];
int thread_elems[MAX_REGISTERS_PER_THREAD];
//
for(int i=0;i<thread_elems_num;++i){
thread_elems[i]=g_elems[begin+i].val;
}
if(bid==num_blocks-1 && thid==threads_num-1)
thread_elems[thread_elems_num]=INT_MAX;
else
thread_elems[thread_elems_num]=g_elems[begin+thread_elems_num].val;
int unsorted_thread=0;
for(int i=0;i<thread_elems_num-1 && i<MAX_REGISTERS_PER_THREAD-1;++i){
if(thread_elems[i]>thread_elems[i+1]){
// unsorted_thread=1;
break;
}
}
__syncthreads();
shared[thid]=unsorted_thread;
return;
}
The code probably is correct. The block size you are using probably isn’t. Uncommenting that 1 line probably uses an extra register and puts your execution parameters out of limits. This is discussed in Chapter 4 of the programming guide. You can see the number of registers your kernel uses by passing the -verbose argument to ptxas during compilation.
–verbose says nothing about block size and registers. When i uncomment this line and add (to if-block) “else” with the same line, it works. So there’s a problem with registers. How can i solve this? I want rewrite values from global memory to registers (thread_elems), next compare themselves (these registers) and write result to unsorted_thread. At the end i rewrite result from unsorted_thread to shared memory. I call kernel with one block, two threads and shared memory size=16336.
Run the code in Ocelot to find the exact line and cause of the error. Nexus and cuda-gdb might also work.
A likely cause: you’re not providing enough dynamic shared memory for the number of threads you’re using per block. You need to specify threadcount*sizeof(int) as your shared memory size at the kernel invocation. Perhaps you forgot this, or specified just threadcount?
Next, don’t overanalyze that one line you comment out. You’re underestimating the optimizations the compiler can perform. with that commented out, your entire program likely reduces by successive dead code reduction down to a single line or two.
Finally, I assume this isn’t really your program, since in practice it never writes anything to global memory so the whole kernel is just a no-op.
I resized shared memory size (from 16336 to threadcount*sizeof(int)) but still got the same error.
It is my program, but the rest of code i commented and didn’t paste there, so you can’t see writing to global memory. (I must commented next code lines while error detecting.) But that i commented can’t be a problem’s reason, yes? OK, i’ll restore it.
I’ll try with cuda-gdb. Thanks! But are there any other ideas to solve this problem?
avidday@cuda:~/code/Jacobi$ nvcc -arch=sm_13 -Xptxas "--verbose" jacobi.cu -o jacobi
ptxas info : Compiling entry function '_Z10JORKernelRPKfS0_S0_Pf' for 'sm_13'
ptxas info : Used 10 registers, 808+16 bytes smem, 12 bytes cmem[0], 12 bytes cmem[1]
ptxas info : Compiling entry function '_Z10JORKernelCPKfS0_S0_Pf' for 'sm_13'
ptxas info : Used 8 registers, 32+16 bytes smem, 12 bytes cmem[0], 8 bytes cmem[1]
ptxas info : Compiling entry function '_Z12JORResidualRPKfS0_S0_Pf' for 'sm_13'
ptxas info : Used 9 registers, 800+16 bytes smem, 12 bytes cmem[0], 4 bytes cmem[1]
ptxas info : Compiling entry function '_Z12JORResidualCPKfS0_S0_Pf' for 'sm_13'
ptxas info : Used 6 registers, 32+16 bytes smem, 12 bytes cmem[0]
Right there you have the per thread register, shared memory and constant memory usage for each kernel. If you read the section of the programming guide I referred you to, you can calculate the block size limit for each kernel you need to launch.