Is padding needed for Malloc?

When malloc:ing say 10-member vectors a() ,b() and c(), and multiplying c(i)=a(i)*b(i) with 16 threads, what happens with the 10th to 16th thread? Do they read from outside of a() and b() and save that to past of c()? I can’t see any methods to control what happens when there are more threads than members in vectors. Is the only possibility to Malloc extra members for the remaining threads?

you must set boundary condition explicitly in your kernel code, for example

[codebox]global void add( float *C, float *A, float *B , unsigned int N )


unsigned int idx = blockIdx.x * BLOCK_DIM + threadIdx.x;

if ( idx < N ){ // N is size of A and B

	C[i] = A[i] * B[i] ;



Hi. Thanks for example. It is still a bit unclear to me, what happens to the “overflow” threads between N and blocsize. As far as I understand, all threads in a block are executing exactly the same commands. How is it then prevented that the overflow threads do nothing?