CUDA program issue, for loop

I met a trouble while using for loop in cuda kernel programming.
I have already use 3-dim index parallel the problem, and still need some for loop within each thread.
And the for loop counter went nuts driven me also.
Here is a simpler kernel, which does not run as i wish, hope you can teach me some fundamental i missed!

global void update_theta_kernel(volatile float *Z,volatile float *beta,volatile float *theta,
int SampleSize
){

    float priv_theta_sth = 0.0f;

    __syncthreads();
    for (int i = 0; i < 3;i+=1){

        printf("priv_theta = %d in loop i = %u\n",priv_theta_sth,i);
        printf("loop i = %u\n",i);

} 
__syncthreads();

}

the first “printf” do not update the counter, while the second does, I dont know why

What do you mean, it does not update it? The update of the counter 0->1 happens after the first iteration and after all following iterations. In the first iteration the counter is 0. That is how for loops work, nothing Cuda specific about it.

priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
loop i = 1
loop i = 1
loop i = 1
loop i = 1
loop i = 1
loop i = 1
loop i = 1
loop i = 1
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
priv_theta = 0 in loop i = 0
loop i = 2
loop i = 2
loop i = 2
loop i = 2
loop i = 2
loop i = 2
loop i = 2
loop i = 2
the outcome were like this

Perhaps also output threadIdx and blockIdx with each line to know, from which thread they are coming from. Each thread has its own counter.

I understand your issue is with loop i = 0.

Could you run just one block and one thread? Could you use the correct %d instead of %u?

yeah, i tired exactly what you said before. But in this case, apparently, none of the threads come up with the right answer

%d gives the same anser, one block one thread may not help with acceleration

Yes, but the first question is, what is going wrong. If there is a bug.

For the float value you need %f. That could damage the remaining line of printf.

1 Like

Wow, %f solves this issue!

Sadly, there is no compile error.
The parameters are passed and read differently, depending on type.

thanks, that’s really inspiring. The big issue i met occures when i use the counter to index some long matrix, and the counter ovorflow, the ptr fly away. Do you have any ideas on that? Or maybe i would start a new issue later.

I would start a new issue. Perhaps with the fixed printf you already find out, what went wrong.