Number of threads in kernel doesn't work as expected strange behavior

kirbuchi · July 2, 2010, 9:19pm

Hi, I’m having some trouble with a very basic CUDA program. I have a program that multiplies two vectors on the Host and on the Device and then compares them. This works without a problem. What’s wrong is that I’m trying to test different number of threads and blocks for learning purposes. I have the following kernel:

__global__ void multiplyVectorsCUDA(float *a,float *b, float *c, int N){

	int idx = threadIdx.x;

	if (idx<N) 

		c[idx] = a[idx]*b[idx];

}

which I call like:

multiplyVectorsCUDA <<<nBlocks, nThreads>>> (vector_a_d,vector_b_d,vector_c_d,N);

For the moment I’ve fixed nBLocks to 1 so I only vary the vector size N and the number of threads nThreads. From what I understand, there will be a thread for each multiplication so N and nThreads should be equal.

The problem is the following

I first call the kernel with N=16 and nThreads<16 which doesn’t work. (This is ok)
1. Then I call it with N=16 and nThreads=16 which works fine. (Again works as expected)
2. But when I call it with N=16 and nThreads<16 it still works!

I don’t understand why the last step doesn’t fail like the first one. It only fails again if I restart my PC.

Has anyone run into something like this before or can explain this behavior?

Note: I have allocated and deallocated respectively CUDA memory as follows:

Allocate

cudaMalloc((void **) &vector_a_d, vector_size);

	cudaMemcpy(vector_a_d, vector_a_h, vector_size, cudaMemcpyHostToDevice);

 	cudaMalloc((void **) &vector_b_d, vector_size);

	cudaMemcpy(vector_b_d, vector_b_h, vector_size, cudaMemcpyHostToDevice);

	cudaMalloc((void **) &vector_c_d, vector_size);

Deallocate

cudaFree(vector_a_d);

	cudaFree(vector_b_d);

	cudaFree(vector_c_d);

seibert · July 2, 2010, 10:36pm

How are you defining “works”? If host and device get the same answer? In that case, you are probably reading the left over results at the end of the array (past nThreads) from case 2. This can be true even if you free and malloc between case 2 and 3 because the driver can hand back to you the block you previously allocated totally unchanged. No memory clearing is done at allocation time or free time.

Topic		Replies	Views
Automate number of blocks and threads for block CUDA Programming and Performance	6	2304	December 17, 2011
Understanding Threads in CUDA help me find the exact number of threads for my code CUDA Programming and Performance	4	2435	July 13, 2009
Number of threads affecting answer; this should not happen a VERY strange error.. CUDA Programming and Performance	8	2614	July 17, 2009
Can some one check this for me please..... Newbie needs help learning CUDA Programming and Performance	2	2676	April 10, 2008
Unexpected behavior with varying number of threads per block CUDA Programming and Performance	2	3474	November 5, 2008
Memory management in the device Is there any caching in device's memory? CUDA Programming and Performance	2	3623	September 4, 2008
Number of Blocks CUDA Programming and Performance	3	1763	October 15, 2011
Strange behavior on threads number increace CUDA Programming and Performance	2	2005	March 4, 2009
Operation result depend on number of threads? CUDA Programming and Performance	2	534	May 6, 2014
Run a million threads or blocks on a single kernel function, and still works. It supposed to be 512 at maximum, isn't it? CUDA Programming and Performance	4	1404	January 6, 2017

Number of threads in kernel doesn't work as expected strange behavior

Related topics