Run the 640 iterations in parallel instead of sequentially, by creating 640 threads (20 blocks of 32 threads per block):
__global__ void KerFun() {
int j = blockIdx.x * blockDim.x + threadIdx.x;
// do here what you would do within the loop.
}
main() {
// set up for kernel execution
KerFun<<<20,32,0>>>();
}
It’s not possible to spawn threads from within kernel code, so from one invocation of KerFun you can’t launch multiple threads for the 640 executions DevFun.
If KerFun needs to be run only once, you could perform the processing on the host and pass the information, or invoke it as a separate kernel:
__global__ void DevFun() {
int j = blockIdx.x * blockDim.x + threadIdx.x;
// do here what you would do within the loop.
}
__global__ void KerFun() {
// do here what you want to do only once.
}
main() {
// set up for kernel execution
KerFun<<<1,1,0>>>();
DevFun<<<20,32,0>>>();
}