understanding threads cant understand how it works..

in the matrixmultiplication which was given as an example in the programming guide and later on included in the projects alongwith SDK for execution,it did generate threads but i dont know if there are more threads being generated by the loop which assigns each data from the shared memory between host and device to threads.but still the thread index value doesnt seem to change.neither does the block value.correct me if i am wrong somewhere.

as far as i understand, threads can be run parallel in GPU and the threads make up a thread block and thread block of same dimensions make up a thread grid.

apart from that i want to ask question,can GPU like this run only data parallel programs(e.g dot products) or is there any alternative to introduce parallelism within your code that would have been running on CPU but at the moment it is running on GPU.

can i write some programs with sequential flow just to check out the correctness of GPU results.

thanks.

Khurram Hameed

threads are only ‘generated’ when calling the kernel from the host like so :
my_kernel<<<numblocks, numthreads>>>(params)

The code within global void my_kernel(params) is being run by all threads, where each threads can find out who he is by using, blockIdx & threadIdx. But during the runtime of a kernel, no threads are being generated at all.

Maybe you should start out with the reduction example, it is much easier to understand (I speak from experience)