Is there any difference in behavior of blocks Vs. threads?

Is there a difference between threads and blocks when it comes to executing instructions or is it just a physical constraint? In other words, is there a difference when doing func<<2,1>>>() and func<<1,2>>>() in performance or is it just a matter of physical limitation of x number of threads in a block and x number of blocks in total?

Is there a difference between threads and blocks when it comes to executing instructions or is it just a physical constraint? In other words, is there a difference when doing func<<2,1>>>() and func<<1,2>>>() in performance or is it just a matter of physical limitation of x number of threads in a block and x number of blocks in total?

Yes, there are big differences. Threads of the same block execute in parallel and can exchange data through shared (or global) memory. The order of execution of threads from different blocks is undefined and they have no common shared memory space.

Check section 2.2 of the Programming Guide.

Yes, there are big differences. Threads of the same block execute in parallel and can exchange data through shared (or global) memory. The order of execution of threads from different blocks is undefined and they have no common shared memory space.

Check section 2.2 of the Programming Guide.

Thanks, that’s useful to know. So in regards to memory space, does that include the space I malloc to the device, or does cuda just make multiple copies of the space I allocated? Any other characteristics that differ when coding?

Thanks, that’s useful to know. So in regards to memory space, does that include the space I malloc to the device, or does cuda just make multiple copies of the space I allocated? Any other characteristics that differ when coding?