Number of blocks and threads

panourg · November 29, 2011, 10:36am

Dear all

When i call a kernel for execution what number of blocks and number of threads per block must i set? This is depended by gpu but i do not understand how i must set it.
For example, if have a gpu with 32 cores how i can set these two numbers? The number of 512 threads per block dependent by gpu?

Thanks

kostas

pasoleatis · November 30, 2011, 4:09pm

Take look at this code which does vector -vector addition.

#include <stdio.h>

#include <stdlib.h>

#include <math.h>

// CUDA kernel. Each thread takes care of one element of c

__global__ void vecAdd(float *a, float *b, float *c, int n)

{

    // Get our global thread ID

    int id = blockIdx.x*blockDim.x+threadIdx.x;

// Make sure we do not go out of bounds

    if (id < n)

        c[id] = a[id] + b[id];

}

int main( int argc, char* argv[] )

{

    // Size of vectors

    int n = 100000;

// Host input vectors

    float *h_a;

    float *h_b;

    //Host output vector

    float *h_c;

// Device input vectors

    float *d_a;

    float *d_b;

    //Device output vector

    float *d_c;

// Size, in bytes, of each vector

    size_t bytes = n*sizeof(float);

// Allocate memory for each vector on host

    h_a = (float*)malloc(bytes);

    h_b = (float*)malloc(bytes);

    h_c = (float*)malloc(bytes);

// Allocate memory for each vector on GPU

    cudaMalloc(&d_a, bytes);

    cudaMalloc(&d_b, bytes);

    cudaMalloc(&d_c, bytes);

int i;

    // Initialize vectors on host

    for( i = 0; i < n; i++ ) {

        h_a[i] = sinf(i)*sinf(i);

        h_b[i] = cosf(i)*cosf(i);

    }

// Copy host vectors to device

    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);

    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

int blockSize, gridSize;

// Number of threads in each thread block

    blockSize = 1024;

// Number of thread blocks in grid

    gridSize = (int)ceil((float)n/blockSize);

// Execute the kernel

    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

// Copy array back to host

    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

// Sum up vector c and print result divided by n, this should equal 1 within error

    float sum = 0;

    for(i=0; i<n; i++)

        sum += h_c[i];

    printf("final result: %f\n", sum/n);

// Release device memory

    cudaFree(d_a);

    cudaFree(d_b);

    cudaFree(d_c);

// Release host memory

    free(h_a);

    free(h_b);

    free(h_c);

return 0;

}

Topic		Replies	Views
understading Number of blocks and threads CUDA Programming and Performance	5	1661	April 23, 2010
How many blocks shall I initialize? CUDA Programming and Performance	2	922	July 15, 2013
Confused about number of threads, block, grid... My first CUDA app CUDA Programming and Performance	2	2440	October 9, 2009
Limitation of blocks and threads CUDA Programming and Performance	0	2119	March 30, 2012
blocks and threads CUDA Programming and Performance	3	4183	November 17, 2008
Automate number of blocks and threads for block CUDA Programming and Performance	6	2217	December 17, 2011
Optimization problem how many blocks/ threads... CUDA Programming and Performance	1	1908	July 9, 2010
Complete Novice Question Question on the basic implementation of a kernel CUDA Programming and Performance	6	4381	October 27, 2009
Block/threads and stuff... CUDA Programming and Performance	5	4924	September 12, 2008
Block size and grid size CUDA Programming and Performance	5	8400	April 27, 2009

Number of blocks and threads

Related topics