Hi, I’m having some trouble with a very basic CUDA program. I have a program that multiplies two vectors on the Host and on the Device and then compares them. This works without a problem. What’s wrong is that I’m trying to test different number of threads and blocks for learning purposes. I have the following kernel:
__global__ void multiplyVectorsCUDA(float *a,float *b, float *c, int N){
int idx = threadIdx.x;
if (idx<N)
c[idx] = a[idx]*b[idx];
}
which I call like:
multiplyVectorsCUDA <<<nBlocks, nThreads>>> (vector_a_d,vector_b_d,vector_c_d,N);
For the moment I’ve fixed nBLocks to 1 so I only vary the vector size N and the number of threads nThreads. From what I understand, there will be a thread for each multiplication so N and nThreads should be equal.
The problem is the following
-
I first call the kernel with N=16 and nThreads<16 which doesn’t work. (This is ok)
-
Then I call it with N=16 and nThreads=16 which works fine. (Again works as expected)
-
But when I call it with N=16 and nThreads<16 it still works!
-
I don’t understand why the last step doesn’t fail like the first one. It only fails again if I restart my PC.
Has anyone run into something like this before or can explain this behavior?
Note: I have allocated and deallocated respectively CUDA memory as follows:
Allocate
cudaMalloc((void **) &vector_a_d, vector_size);
cudaMemcpy(vector_a_d, vector_a_h, vector_size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &vector_b_d, vector_size);
cudaMemcpy(vector_b_d, vector_b_h, vector_size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &vector_c_d, vector_size);
Deallocate
cudaFree(vector_a_d);
cudaFree(vector_b_d);
cudaFree(vector_c_d);