how to use threads for A[1024][1024] threads

hai sir

i want know how to use threads for A[1024][1024] … i want aceess 1024 threads at a time in parrell … if it is possible then can any one write the syntaax for this A[1024][1024]

but i was using like below mentioned… but getting errors… any one help

      my_kernel<<<1, dim3(1,1024)>>>(A);

As is written in the programming guide, the maximum amount of threads you can use in a block is 512. And it will even be less, depending on the amount of registers used by the kernel.

buj, could you please stop posting the same question over and over again?

you already got your answer to the 2-dimensional arrays: don’t use them.

if you have a CPU code like this:

float A[1024][1024];

for(int y=0;y<1024;y++){

  for(int x=0;x<1024;x++){

	A[x][y]=whatever you want to do;

  }

}

and want to port this to CUDA, do something like the following:

__global__ static void cudaFunction(float *A){

  int x=threadIdx.x+blockIdx.x*blockDim.x;

  int y=blockIdx.y;

  A[x+y*1024]=whatever you want to do;

}

int main(int argc,int *argv[]){

  float *A_host=(float*)malloc(1024*1024*sizeof(float));   //allocate the array in host memory

  //do preprocessing or whatever with A_host, use A_host[x+y*1024] to access elements

  float *A_device;

  cudaMalloc((void**)&A_device,1024*1024*sizeof(float));   //allocate the array in memory on your graphics card

  cudaMemcpy(A_device,A_host,1024*1024*sizeof(float),cudaMemcpyHostToDevice);   //copy the contents of the host array to the device array, so you can work with that on the gpu

  cudaFunction<<<dim3(1024/256,1024,1),dim3(256,1,1)>>>(A_device);  //call the kernel on the gpu

  cudaMemcpy(A_host,A_device,1024*1024*sizeof(float),cudaMemcpyDeviceToHost);   //copy the results from the gpu back to host memory

  //do whatever else you want to do with A_host

  return 0;

}

Ocire, Dont be rude to beginners. Its ok if they ask them again and again. It just means that they have not understood.

hopefully, buj has understood things this time…

Best Regards,

Sarnath

thank you sir

Dear sir

thank u for reply… thing is we are beginners so, we need some time to learn . … please clarify my doubts kindly

have a nice day sir

                                 thanking you sir

sorry for sounding rude. i should have added “in different threads”. it’s totally ok to post a “sorry, i don’t understand” after an answer but, imho, opening a second thread with nearly the same question and replying with this question to another thread with a completely different theme isn’t.

it would help alot if you could tell what exactly you don’t understand in the code above. ;-)

i have gone through tat code … i got it now …

thank you sir

:) I can understand…

dear sir

can u tell sme how to calculate GPU time for A[1024][1024] … which was written by you… how to initialize timer ,start timer for my two dimwnsional arry

Almost all SDK examples have timing included, you can copy&paste from there.

i have seen those sdk examples… but i am little bit confusing about tat … thing is where i have to initialize timers for below mentioned program… i want know how much time it is taking for A[1024] [ 1024] on the gpu … hwlp me sir

global static void cudaFunction(float *A)

{

int x=threadIdx.x+blockIdx.x*blockDim.x;

int y=blockIdx.y;

A[x+y*1024]=whatever you want to do;

}

int main(int argc,int *argv){

float A_host=(float)malloc(10241024sizeof(float)); //allocate the array in host memory

//do preprocessing or whatever with A_host, use A_host[x+y*1024] to access elements

float *A_device;

cudaMalloc((void**)&A_device,10241024sizeof(float)); //allocate the array in memory on your graphics card

cudaMemcpy(A_device,A_host,10241024sizeof(float),cudaMemcp

yHostToDevice); //copy the contents of the host array to the device array, so you can work with that on the gpu

cudaFunction<<<dim3(1024/256,1024,1),dim3(256,1,1)>>>(A_device); //call the kernel on the gpu

cudaMemcpy(A_host,A_device,10241024sizeof(float),cudaMemcp

yDeviceToHost); //copy the results from the gpu back to host memory

//do whatever else you want to do with A_host

return 0;

}

Dear sir can you expalin below mentioned code and is there any other way to declare this code …

cudaFunction<<<dim3(1024/256,1024,1),dim3(256,1,1)>>>(A_device)

Really, you should read the programming guide. Your questions suggest that reading it carefully will help you a lot.

Dear sir

i have gone through that programing guide … but there is no timer examples … i am asking like this … plse kindly clarify my doubts

thanking you sir

look at cutil.h (it has doxygen-style comments) or just use cudaEvents and cudaEventElapsedTime.

ok … thank you sir