Loop inside kernel

Hi everybody,

I am new to Cuda, so maybe (I am almost sure) this question is going to be silly. I’d like to know what happens when you got a loop inside the kernel, for example:

for (int i=0;i<N;i++){

}

I’ve read each thread has 32bit of local memory, but if we first declare tid as:

threadIdx.x+blockIdx.x*blockDim.x

I guess the ‘i’ in the loop will be located in shared memory, won’t it?. So, how could I make a loop inside a kernel??, I want each thread to compute ‘m’ values.

Thanks

Hi everybody,

I am new to Cuda, so maybe (I am almost sure) this question is going to be silly. I’d like to know what happens when you got a loop inside the kernel, for example:

for (int i=0;i<N;i++){

}

I’ve read each thread has 32bit of local memory, but if we first declare tid as:

threadIdx.x+blockIdx.x*blockDim.x

I guess the ‘i’ in the loop will be located in shared memory, won’t it?. So, how could I make a loop inside a kernel??, I want each thread to compute ‘m’ values.

Thanks

I’ve just test it and it seems like every thread do the loop, sorry for that silly question.

I’ve just test it and it seems like every thread do the loop, sorry for that silly question.

Your loop counter (i) appears to have nothing to do with your thread ID (tid) so it will work just like any normal c code. The compiler will choose where to store i; it could be a register.
Are you saying you want the loop limit (N) to be m so that you calculate m values? That’s fine; a thread can calculate as many values as you want.

Your loop counter (i) appears to have nothing to do with your thread ID (tid) so it will work just like any normal c code. The compiler will choose where to store i; it could be a register.
Are you saying you want the loop limit (N) to be m so that you calculate m values? That’s fine; a thread can calculate as many values as you want.

Yes, I have a mxN matrix so I want N threads to compute m values.

I am trying now to measure the performance of the machine, but when I measure the execution time of the kernel with clock() it always gives me 0, even with a loop of 1000000. Is there another function for timing on cuda?.

Sorry, I know it doesn’t suit for this post. And sorry again for my english…

Yes, I have a mxN matrix so I want N threads to compute m values.

I am trying now to measure the performance of the machine, but when I measure the execution time of the kernel with clock() it always gives me 0, even with a loop of 1000000. Is there another function for timing on cuda?.

Sorry, I know it doesn’t suit for this post. And sorry again for my english…