Number of threads in kernel doesn't work as expected strange behavior

Hi, I’m having some trouble with a very basic CUDA program. I have a program that multiplies two vectors on the Host and on the Device and then compares them. This works without a problem. What’s wrong is that I’m trying to test different number of threads and blocks for learning purposes. I have the following kernel:

__global__ void multiplyVectorsCUDA(float *a,float *b, float *c, int N){

	int idx = threadIdx.x;

	if (idx<N) 

		c[idx] = a[idx]*b[idx];

}

which I call like:

multiplyVectorsCUDA <<<nBlocks, nThreads>>> (vector_a_d,vector_b_d,vector_c_d,N);

For the moment I’ve fixed nBLocks to 1 so I only vary the vector size N and the number of threads nThreads. From what I understand, there will be a thread for each multiplication so N and nThreads should be equal.

The problem is the following

  1. I first call the kernel with N=16 and nThreads<16 which doesn’t work. (This is ok)

    1. Then I call it with N=16 and nThreads=16 which works fine. (Again works as expected)

    2. But when I call it with N=16 and nThreads<16 it still works!

I don’t understand why the last step doesn’t fail like the first one. It only fails again if I restart my PC.

Has anyone run into something like this before or can explain this behavior?

Note: I have allocated and deallocated respectively CUDA memory as follows:

Allocate

cudaMalloc((void **) &vector_a_d, vector_size);

	cudaMemcpy(vector_a_d, vector_a_h, vector_size, cudaMemcpyHostToDevice);

 	cudaMalloc((void **) &vector_b_d, vector_size);

	cudaMemcpy(vector_b_d, vector_b_h, vector_size, cudaMemcpyHostToDevice);

	cudaMalloc((void **) &vector_c_d, vector_size);

Deallocate

cudaFree(vector_a_d);

	cudaFree(vector_b_d);

	cudaFree(vector_c_d);

How are you defining “works”? If host and device get the same answer? In that case, you are probably reading the left over results at the end of the array (past nThreads) from case 2. This can be true even if you free and malloc between case 2 and 3 because the driver can hand back to you the block you previously allocated totally unchanged. No memory clearing is done at allocation time or free time.