Help me!

i want to know :

What is threadIdx. ?
i can determine threadIdx from. ?


threadIdx is a structure where you can get a thread index. threadIdx is built-in.

if i use Geforce 9400 GT i want to know

ThreadIdx have a value = ?
blockIdx = ?


sory im write english is bad

Hi Pingkung, i’am also realy newbe by cuda and i will say i also have too much Problem to understand how these indexe works. i am

not realy sure i understand all but i can already write some Kernel which work.

  1. think your data as vector like V(x1,x2,x3,…,xn). It is posible to covert your matrix into vector form, so you will not have limitation working with vector.

  2. dependent of the work we like cuda to do you shoud specify the configuration information for the Kernel. mykernel<<>>(parameter).

3)configuration information are: a)size of Grid (number of Blocks within a grid) and number of threads whitin a Block, so if you like the kernel to increment the element og a 64 elements vectors you can specify mykernel<<<1,64>> (parameter) or mykernel<<<8,8>>(parameter), (in first case you have one Block in Grid and this Block hat 64 thead).

4)now you need to work througth this vector, for this purpose you need (mykernel<<<8,8>>(parameter) ) to calculate index i so you cant write V[i]=V[i]+1; in your kernel.

  1. i= _mul24(blockIdx.x,blockDim.x) + threadIdx.x; where all the element are buildin in the CUDA rutime API.

take a lokk at this peace of code . i hope it can help you better understand this indixe.


#include <stdio.h>

#include <assert.h>

#include <cuda.h>

void incrementArrayOnHost(float *a, int N)


  int i;

  for (i=0; i < N; i++) a[i] = a[i]+1.f;


__global__ void incrementArrayOnDevice(float *a, int N)


  int idx = blockIdx.x*blockDim.x + threadIdx.x;

  if (idx<N) 

	  a[idx] = a[idx]+1.f;


int main(void)


  float *a_h, *b_h;		   // pointers to host memory

  float *a_d;				 // pointer to device memory

  int i, N = 10;

  size_t size = N*sizeof(float);

  // allocate arrays on host

  a_h = (float *)malloc(size);

  b_h = (float *)malloc(size);

  // allocate array on device 

  cudaMalloc((void **) &a_d, size);

  // initialization of host data

  for (i=0; i<N; i++) 

	  a_h[i] = (float)i;

  // copy data from host to device

  cudaMemcpy(a_d, a_h, sizeof(float)*N, cudaMemcpyHostToDevice);

  // do calculation on host

printf("Result bevor computation on host\n");

  for(i=0; i<N; i++)


  incrementArrayOnHost(a_h, N);

  printf("\n\n\n......computation on host\n");

  printf("Result of computation on host\n");

  for(i=0; i<N; i++)


  // do calculation on device:

  // Part 1 of 2. Compute execution configuration

  int blockSize = 4;

  int nBlocks = N/blockSize + (N%blockSize == 0?0:1);

  // Part 2 of 2. Call incrementArrayOnDevice kernel 

  incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);

  // Retrieve result from device and store in b_h

  cudaMemcpy(b_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

  // check results

  for (i=0; i<N; i++) 

	  assert(a_h[i] == b_h[i]);

  // cleanup





  return 0;


There is a sample code here:

Hope this will help you understand better.



thank you.