I’m trying to port my code from C1060 to C2050 and when I profile the applicaion, I see that a kernel executing on C2050 is taking more registers per thread. For example the code shown below takes 2, 3 registers per thread on C1060 and C2050 respectively.
global void func(int arg1,int arg2,int *arg3)
if(threadIdx.x >= arg2)return;
arg3[threadIdx.x] = arg1;
int main(int argc,char *argv)
cudaMalloc((void **)&d_array,sizeof(int)*512); func<<<1,512>>>(512,512,d_array);
Actually the my kernel’s occupancy on SM is constrained by the maximum number of registers on SM. So can you please tell me what is happening on C2050 and how to reduce the number of registers per thread on C2050.