Hi…
I’m newbie for CUDA. Can any one advice for me?
I’m consider cuda parallel programming for loop parallel.
I have two questions…
I’ll divide loop of 1~N wih 128 threads… N is 1024^4 (large iteration)
each 128 core works N/128 iteration.
it has core kernel function.
which one should I use to use full performence?
I want make 128 thread (1 thread per each core) which one I use??
runtest<<<1,128>>>();
tid = threadIdx; (0~127)
or
runtest<<<32,8>>>();
itd=blockIdx*32 + threadIdx; (0~127)
runtest<<<32,128>>> ();
itd=blockIdx*32 + threadIdx;
I will use <<<32,8>>> method… Is it right setting?
if then, each thread run 1~1024^4/128 iteration on each
the thread will use some 2D arrays within kernel function.
global memory (example : float b_device[idx][k] : idx= 0~127, k=0~1000 )
idx for thread identification
k for computation…
each core use only threadIdx-fixed data
or I use
arrayidx=threadID*128 + k;
a_device[arrayidx] ; arrayidx = 0~ 128*1024^4
I think computing arrayidx=threadID*128 + k; in every thread is wasting computing time. I use a_device or b_device… which one I use?
If CUDA support 2D array I’ll use b_device only
or any method to control 2D array in device function…
in manual, I’ve read memorypitch or some… but I did not understand it…
My pseudo code is below [not perfect… sorry]
any comment, any advice are welcome… External Image
/* test.cu */
main(){
   Cuda_Malloc(a_device, 128*1024^4 *sizeof(float) ); // 1D array
   Cuda_Malloc(b_device, 128*1024^4 *sizeof(float) ); // 2D array
   runtest<<<1,128>>>(variables );  // 128 threads (1 thread per each core)..
}
__global__ funtion runtest( Â ){
   for (i = 0;i<N/128;i++){
       tid = threadIdx;  // or tid = blockIdx*32+threadIdx;
       x=kernel_function_a(tid);  // each thread has call only 1 function
       sum[i]=sum[i]+x;
   }
}
__kernel__ Â kernel_function_a(tid){
   for (k=0;k<M;k++){
       arrayidx= tid*128+k;                        // wasting computing time?
       a_device[arrayIdx] = a[arrayIdx]+c   // 1D array
       b_device[tid][k] = b[tid][k] +c           // 2D array . can support this?
       d = function of a[k]; ...
   }
return (float) d;
}