What does <<< >>> mean?

Dear CUDA Developers:

I do not know what <<< >>> when the CUDA kernel is called. I know the GPU device function is called. CUDA is supposed to be an extension of C, but I do not know what <<< >>> means in C. It seems similar to cin << >> cout in C++. I would be nice is NVIDIA throughly documented the language, not just the function calls. Since I come from the (“Vastly Superior” - In mooninite voice) Fortran world, it would be nice to have this language reference.

Thanks,
Tony

CUDA is an extension of C programming language and there is some extra notations that can be used in CUDA . <<G,B,N>> indicates the execution configuration for a given kernel,i.e, indicator of dimensions of grids and blocks used to execute the kernel. you can find additional details in CUDA documentation.

when you know Fortran, you might know Hosts, real Hosts

with Job-queues.

programcode <<<T,G,S >>>(param1, param2 …)

means that you will start programmcode several times with an amount of S shared memory bytes and parameters param1, param2 …

The exact ‘several times’ is specified by T and G.

T (threads) is an array of three integers tx,ty,tz , the product of them may not exceed 512

G (Grid) is a an array of three integers gx and gy, where the third is ignored, actually it has only two. The Product of the significant 2 Integers may not exceed (2^32 -1)/512

If you e.g. fill T with 2,3,4 and G with 5,6 and call the program code,

the program code wil be called 65 * 43*2=720 times.

The calls in 432 are initiated in sequence, the code might run in parallel though.

The order of of the calls 432 with in 6*5 in not defined.

It is up the the device, how many calls really will run in parallel.

The device passes the tx,ty.tz gx,gy values to the program code.

Each call has an index the_index which you can fugureout in the code

global device void programcode ( float * array ){

const int inbl=threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.y;

const int abl=blockDim.x * blockDim.y * blockDim.z;

   the_index=  inbl + ( (blockIdx.x + blockIdx.y * gridDim.x)* abl);

   array[the_index]=the_index

}

will fill an Array of 720 elemtns with 0 to 719 potentially highly parallel.

The kernale call would be

dim3 t;

dim3 g;

t.x=2;

t.y=3;

t.y=4;

g.x=5;

g.y=6;

float *array;

//will allocate memory on the device

//and write the memory adress to array.

cudaMalloc((void**)&array, 720 * sizeof(float);

//call the code such and such times with 0 shared memory an array as parameter

programcode <<<t,g,0>>>(array)

And the documentation in section 4.2: Language extensions, isn’t good enough? I found the documentation on the execution configuration in section 4.2.3 to adequate. What more can you want, an full grammar for the language?

Actually, this doesnt work, for some reason it only fills 0-239 with the corresponding value. i have no idea why, so if you do, please :)

Otherwise this piece of text you wrote was pricelessa as it clears lots of things out for me.

this:

dim3 t;

dim3 g;

t.x=2;

t.y=3;

t.y=4;

g.x=5;

g.y=6;

should be:

dim3 t;

dim3 g;

t.x=2;

t.y=3;

t.z=4;

g.x=5;

g.y=6;

gives me a cut_check_error instead tho…

Change this line:

 const int inbl=threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.y;

to this:

 const int inbl=threadIdx.x + blockDim.x * (threadIdx.y + blockDim.y * threadIdx.z);

HTH.