Hi…

I’m newbie for CUDA. Can any one advice for me?

I’m consider cuda parallel programming for loop parallel.

I have two questions…

I’ll divide loop of 1~N wih 128 threads… N is 1024^4 (large iteration)

each 128 core works N/128 iteration.

it has core kernel function.

which one should I use to use full performence?

I want make 128 thread (1 thread per each core) which one I use??

runtest<<<1,128>>>();

tid = threadIdx; (0~127)

or

runtest<<<32,8>>>();

itd=blockIdx*32 + threadIdx; (0~127)

runtest<<<32,128>>> ();

itd=blockIdx*32 + threadIdx;

I will use <<<32,8>>> method… Is it right setting?

if then, each thread run 1~1024^4/128 iteration on each

the thread will use some 2D arrays within kernel function.

global memory (example : float b_device[idx][k] : idx= 0~127, k=0~1000 )

idx for thread identification

k for computation…

each core use only threadIdx-fixed data

or I use

arrayidx=threadID*128 + k;

a_device[arrayidx] ; arrayidx = 0~ 128*1024^4

I think computing arrayidx=threadID*128 + k; in every thread is wasting computing time. I use a_device or b_device… which one I use?

If CUDA support 2D array I’ll use b_device only

or any method to control 2D array in device function…

in manual, I’ve read memorypitch or some… but I did not understand it…

My pseudo code is below [not perfect… sorry]

any comment, any advice are welcome…

```
/* test.cu */
main(){
Â Â Â Cuda_Malloc(a_device, 128*1024^4 *sizeof(float) ); // 1D array
Â Â Â Cuda_Malloc(b_device, 128*1024^4 *sizeof(float) ); // 2D array
Â Â Â runtest<<<1,128>>>(variables ); Â // 128 threads (1 thread per each core)..
}
__global__ funtion runtest( Â ){
Â Â Â for (i = 0;i<N/128;i++){
Â Â Â Â Â Â Â tid = threadIdx; Â // or tid = blockIdx*32+threadIdx;
Â Â Â Â Â Â Â x=kernel_function_a(tid); Â // each thread has call only 1 function
Â Â Â Â Â Â Â sum[i]=sum[i]+x;
Â Â Â }
}
__kernel__ Â kernel_function_a(tid){
Â Â Â for (k=0;k<M;k++){
Â Â Â Â Â Â Â arrayidx= tid*128+k; Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â // wasting computing time?
Â Â Â Â Â Â Â a_device[arrayIdx] = a[arrayIdx]+c Â Â // 1D array
Â Â Â Â Â Â Â b_device[tid][k] = b[tid][k] +c Â Â Â Â Â Â Â Â Â Â // 2D array . can support this?
Â Â Â Â Â Â Â d = function of a[k]; ...
Â Â Â }
return (float) d;
}
```