Can I Control Thread ID?


I’m newbie for CUDA. Can any one advice for me?

I’m consider cuda parallel programming for loop parallel.

I have two questions…

I’ll divide loop of 1~N wih 128 threads… N is 1024^4 (large iteration)

each 128 core works N/128 iteration.

it has core kernel function.

which one should I use to use full performence?

I want make 128 thread (1 thread per each core) which one I use??


tid = threadIdx; (0~127)



itd=blockIdx*32 + threadIdx; (0~127)

runtest<<<32,128>>> ();

itd=blockIdx*32 + threadIdx;

I will use <<<32,8>>> method… Is it right setting?

if then, each thread run 1~1024^4/128 iteration on each

the thread will use some 2D arrays within kernel function.

global memory (example : float b_device[idx][k] : idx= 0~127, k=0~1000 )

idx for thread identification

k for computation…

each core use only threadIdx-fixed data

or I use

arrayidx=threadID*128 + k;

a_device[arrayidx] ; arrayidx = 0~ 128*1024^4

I think computing arrayidx=threadID*128 + k; in every thread is wasting computing time. I use a_device or b_device… which one I use?

If CUDA support 2D array I’ll use b_device only

or any method to control 2D array in device function…

in manual, I’ve read memorypitch or some… but I did not understand it…

My pseudo code is below [not perfect… sorry]

any comment, any advice are welcome… :))

/* */


 Â Â Â Cuda_Malloc(a_device, 128*1024^4 *sizeof(float) ); // 1D array 

 Â Â Â Cuda_Malloc(b_device, 128*1024^4 *sizeof(float) ); // 2D array 

   runtest<<<1,128>>>(variables );  // 128 threads (1 thread per each core).. 


__global__ funtion runtest( Â ){

 Â Â Â for (i = 0;i<N/128;i++){

 Â Â Â Â Â Â Â tid = threadIdx; Â  // or tid = blockIdx*32+threadIdx; 

 Â Â Â Â Â Â Â x=kernel_function_a(tid); Â // each thread has call only 1 function

 Â Â Â Â Â Â Â sum[i]=sum[i]+x;

 Â Â Â }


__kernel__ Â kernel_function_a(tid){

 Â Â Â for (k=0;k<M;k++){

 Â Â Â Â Â Â Â arrayidx= tid*128+k; Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â // wasting computing time? 

        a_device[arrayIdx] = a[arrayIdx]+c   // 1D array 

        b_device[tid][k] = b[tid][k] +c           // 2D array . can support this?

 Â Â Â Â Â Â Â d = function of a[k]; ... 

 Â Â Â }

return (float) d;


First off, you are trying to allocate very large arrays.

Cuda_Malloc(a_device, 128*1024^4 *sizeof(float) );

This is likely to fail…it works out to about 512 terabytes! No graphics card, or even home computer is going to have that kind of memory. You need to create arrays that will be able to fit onto your graphics card ;)

I think you will benefit from looking, and stepping through some of the simple examples provided in the SDK. The basic idea of CUDA is that you specify how many threads you want in each block, and then specify how many blocks you want to run.

Lets say you have 128 threads in each block, and you only have 2 blocks. Each block will have a set of threads with IDs ranging from 0 ->127. So, to directly answer your question, the thread ID for your case is simply threadIdx.x. Now, normally, you will also have to use the block ID (blockIdx.x) as well. You may have to do things like:

my_lil_array[128 * blockIdx.x + threadIdx.x] = myFinalValue;

Where each thread in each block will write a final value to a memory location. Now, if you have 128 threads per block, and only 2 blocks, you can expect my_lil_array[0->255] to be filled.

Also, you want way more than 128 threads, way, way more.
Each multiprocessor processes blocks of threads. It can at most process 768 threads at a time. So say you have a card of 16 multiprocessors & your algorithm does not use too many registers to limit the amount of threads a multiprocessor can process at once. Than you could do 256 threads per block, which leads to a maximum of 3 blocks per multiprocessor (at the same time, if there are more to process, it will process them when the first are finished). So you need 3 * 16 = 48 blocks as a minimum to make the GPU start to sweat.

48 * 768 = 36864 threads working ‘at the same time’

Running only 1 block will always lead to sub-optimal performance.

your call of <<<1,128>>> will lead to 1 block being generated, which contains 128 threads. So only 1 of the multiprocessors would be working, and that multiprocessor would only have a occupancy of 768/128 = 17%.

But given your code example it is indeed smart to start to look at the simple examples in the SDK.

Maybe your C-Skills a little bit rusty, but ^ is NOT the power-operator, its just XOR. The

expression above should be less than 200 kilobytes.

For the rest I can also only second the tip to the orginal poster to look at the SDK examples.