Hi,
I am trying to compare the performance of using threads and blocks in my CUDA program.
I use NVIDIA GeForce GT 750M 2048 MB on my MacBook Pro for CUDA.
My code is based on this material : http://www.nvidia.com/docs/io/116711/sc11-cuda-c-basics.pdf
So I created three function like these:
#define N 1000000
// using blocks
__global__ void add_block(int *a, int *b, int *c)
{
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
//using threads
__global__ void add_thread(int *a, int *b, int *c)
{
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
// using blocks and threads
__global__ void add_block_thread(int *a, int *b, int *c)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}
And then I tried to sum an array of integers with 1 million of elements for each function (I define N = 1,000,000), record their execution time, and then compare it.
//////calculate with cuda blocks
start = clock(); //tic
// execute the operation on the device
add_block<<<N,1>>>(d_a, d_b, d_c);
end = clock(); //toc
cuda_block_seconds = (float)(end - start) / CLOCKS_PER_SEC;
// copy the result back to host
cudaMemcpy(c_block, d_c, size, cudaMemcpyDeviceToHost);
//////////////////////////////////////////
//////calculate with cuda threads
start = clock(); //tic
// execute the operation on the device
add_thread<<<1,N>>>(d_a, d_b, d_c);
end = clock(); //toc
cuda_thread_seconds = (float)(end - start) / CLOCKS_PER_SEC;
// copy the result back to host
cudaMemcpy(c_thread, d_c, size, cudaMemcpyDeviceToHost);
//////////////////////////////////////////
//////calculate with threads and blocks
start = clock(); //tic
// execute the operation on the device
add_block_thread<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);
end = clock(); //toc
cuda_combine_seconds = (float)(end - start) / CLOCKS_PER_SEC;
// copy the result back to host
cudaMemcpy(c_combine, d_c, size, cudaMemcpyDeviceToHost);
//////////////////////////////////////////
On the code I put number of blocks or threads to be N, which is 1 million. I have checked the results by comparing them with the calculation function on the host sequentially and all of them were correct.
I got results that 1 block with 1 million threads has the best performance than 1 million blocks with 1 thread. here are the printouts:
Calculation without CUDA (sequential) time is 0.042559 seconds
CUDA with 1 million blocks and 1 thread time is 0.000073 seconds
CUDA with 1 million threads and 1 block time is 0.000007 seconds
CUDA with blocks and threads combined time is 0.000014 seconds
What I don’t understand is, our maximum number of threads is supposed to be 512, but why did it still work when I put 1 million on the number of threads or blocks?
Could anyone help me to understand this?
Thank you!