Loop inside kernel

vipuibor · November 16, 2010, 7:22pm

Hi everybody,

I am new to Cuda, so maybe (I am almost sure) this question is going to be silly. I’d like to know what happens when you got a loop inside the kernel, for example:

for (int i=0;i<N;i++){
…
}

I’ve read each thread has 32bit of local memory, but if we first declare tid as:

threadIdx.x+blockIdx.x*blockDim.x

I guess the ‘i’ in the loop will be located in shared memory, won’t it?. So, how could I make a loop inside a kernel??, I want each thread to compute ‘m’ values.

Thanks

vipuibor · November 16, 2010, 7:22pm

Hi everybody,

I am new to Cuda, so maybe (I am almost sure) this question is going to be silly. I’d like to know what happens when you got a loop inside the kernel, for example:

for (int i=0;i<N;i++){
…
}

I’ve read each thread has 32bit of local memory, but if we first declare tid as:

threadIdx.x+blockIdx.x*blockDim.x

I guess the ‘i’ in the loop will be located in shared memory, won’t it?. So, how could I make a loop inside a kernel??, I want each thread to compute ‘m’ values.

Thanks

vipuibor · November 16, 2010, 8:24pm

I’ve just test it and it seems like every thread do the loop, sorry for that silly question.

vipuibor · November 16, 2010, 8:24pm

I’ve just test it and it seems like every thread do the loop, sorry for that silly question.

Dittoaway · November 16, 2010, 8:28pm

Your loop counter (i) appears to have nothing to do with your thread ID (tid) so it will work just like any normal c code. The compiler will choose where to store i; it could be a register.
Are you saying you want the loop limit (N) to be m so that you calculate m values? That’s fine; a thread can calculate as many values as you want.

Dittoaway · November 16, 2010, 8:28pm

Your loop counter (i) appears to have nothing to do with your thread ID (tid) so it will work just like any normal c code. The compiler will choose where to store i; it could be a register.
Are you saying you want the loop limit (N) to be m so that you calculate m values? That’s fine; a thread can calculate as many values as you want.

vipuibor · November 16, 2010, 9:31pm

Yes, I have a mxN matrix so I want N threads to compute m values.

I am trying now to measure the performance of the machine, but when I measure the execution time of the kernel with clock() it always gives me 0, even with a loop of 1000000. Is there another function for timing on cuda?.

Sorry, I know it doesn’t suit for this post. And sorry again for my english…

vipuibor · November 16, 2010, 9:31pm

Yes, I have a mxN matrix so I want N threads to compute m values.

I am trying now to measure the performance of the machine, but when I measure the execution time of the kernel with clock() it always gives me 0, even with a loop of 1000000. Is there another function for timing on cuda?.

Sorry, I know it doesn’t suit for this post. And sorry again for my english…