Hi guys! I know that it’s a difficult question because it depends specifically on how is implemented the kernel, but I wanted to know, which line of reasoning should I follow?
For example I’ve tested a simple kernel like this one:
__global__ void copy(float* A,int A_size,float* B,int B_size){ // dim B != dim A
int tid=threadIdx.x;
int bid=blockIdx.x;
int bd=blockDim.x;
int gd=gridDim.x;
if(B_size<A_size){
while(bid<B_size){
while(tid<B_size){
A[tid+bid*A_size]=B[tid+bid*B_size];
tid+=bd;
}
tid=threadIdx.x;
bid+=gd;
}
}
else{
while(bid<A_size){
while(tid<A_size){
A[tid+bid*A_size]=B[tid+bid*B_size];
tid+=bd;
}
tid=threadIdx.x;
bid+=gd;
}
}
}
it simply copies a matrix (linearized) in another one of different dimension. This kernel is used a lot of times, on the order of 10^6, with matrices of size 100 (10^4 elements) and if I launch it with 313 blocks and 1024 threads I get
instead if I launch it with 313 blocks and 32 threads it is much faster