I met a trouble while using for loop in cuda kernel programming.
I have already use 3-dim index parallel the problem, and still need some for loop within each thread.
And the for loop counter went nuts driven me also.
Here is a simpler kernel, which does not run as i wish, hope you can teach me some fundamental i missed!
global void update_theta_kernel(volatile float *Z,volatile float *beta,volatile float *theta,
int SampleSize
){
float priv_theta_sth = 0.0f;
__syncthreads();
for (int i = 0; i < 3;i+=1){
printf("priv_theta = %d in loop i = %u\n",priv_theta_sth,i);
printf("loop i = %u\n",i);
}
__syncthreads();
}
the first “printf” do not update the counter, while the second does, I dont know why
What do you mean, it does not update it? The update of the counter 0->1 happens after the first iteration and after all following iterations. In the first iteration the counter is 0. That is how for loops work, nothing Cuda specific about it.
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
loop i = 1
loop i = 1
loop i = 1
loop i = 1
loop i = 1
loop i = 1
loop i = 1
loop i = 1
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
loop i = 2
loop i = 2
loop i = 2
loop i = 2
loop i = 2
loop i = 2
loop i = 2
loop i = 2
the outcome were like this
thanks, that’s really inspiring. The big issue i met occures when i use the counter to index some long matrix, and the counter ovorflow, the ptr fly away. Do you have any ideas on that? Or maybe i would start a new issue later.