how to use threads for A[1024][1024] threads

buj · January 3, 2009, 3:14pm

hai sir

i want know how to use threads for A[1024][1024] … i want aceess 1024 threads at a time in parrell … if it is possible then can any one write the syntaax for this A[1024][1024]

but i was using like below mentioned… but getting errors… any one help

      my_kernel<<<1, dim3(1,1024)>>>(A);

E.D_Riedijk · January 3, 2009, 3:30pm

As is written in the programming guide, the maximum amount of threads you can use in a block is 512. And it will even be less, depending on the amount of registers used by the kernel.

Ocire · January 3, 2009, 4:33pm

buj, could you please stop posting the same question over and over again?

you already got your answer to the 2-dimensional arrays: don’t use them.

if you have a CPU code like this:

float A[1024][1024];

for(int y=0;y<1024;y++){

  for(int x=0;x<1024;x++){

	A[x][y]=whatever you want to do;

  }

}

and want to port this to CUDA, do something like the following:

__global__ static void cudaFunction(float *A){

  int x=threadIdx.x+blockIdx.x*blockDim.x;

  int y=blockIdx.y;

  A[x+y*1024]=whatever you want to do;

}

int main(int argc,int *argv[]){

  float *A_host=(float*)malloc(1024*1024*sizeof(float));   //allocate the array in host memory

  //do preprocessing or whatever with A_host, use A_host[x+y*1024] to access elements

  float *A_device;

  cudaMalloc((void**)&A_device,1024*1024*sizeof(float));   //allocate the array in memory on your graphics card

  cudaMemcpy(A_device,A_host,1024*1024*sizeof(float),cudaMemcpyHostToDevice);   //copy the contents of the host array to the device array, so you can work with that on the gpu

  cudaFunction<<<dim3(1024/256,1024,1),dim3(256,1,1)>>>(A_device);  //call the kernel on the gpu

  cudaMemcpy(A_host,A_device,1024*1024*sizeof(float),cudaMemcpyDeviceToHost);   //copy the results from the gpu back to host memory

  //do whatever else you want to do with A_host

  return 0;

}

Sarnath · January 4, 2009, 3:05am

Ocire, Dont be rude to beginners. Its ok if they ask them again and again. It just means that they have not understood.

hopefully, buj has understood things this time…

Best Regards,

Sarnath

buj · January 4, 2009, 12:27pm

thank you sir

buj · January 4, 2009, 12:33pm

buj, could you please stop posting the same question over and over again?

you already got your answer to the 2-dimensional arrays: don’t use them.

if you have a CPU code like this:

float A[1024][1024];

for(int y=0;y<1024;y++){

  for(int x=0;x<1024;x++){

	A[x][y]=whatever you want to do;

  }

}

and want to port this to CUDA, do something like the following:

__global__ static void cudaFunction(float *A){

  int x=threadIdx.x+blockIdx.x*blockDim.x;

  int y=blockIdx.y;

  A[x+y*1024]=whatever you want to do;

}

int main(int argc,int *argv[]){

  float *A_host=(float*)malloc(1024*1024*sizeof(float));   //allocate the array in host memory

  //do preprocessing or whatever with A_host, use A_host[x+y*1024] to access elements

  float *A_device;

  cudaMalloc((void**)&A_device,1024*1024*sizeof(float));   //allocate the array in memory on your graphics card

  cudaMemcpy(A_device,A_host,1024*1024*sizeof(float),cudaMemcpyHostToDevice);   //copy the contents of the host array to the device array, so you can work with that on the gpu

  cudaFunction<<<dim3(1024/256,1024,1),dim3(256,1,1)>>>(A_device);  //call the kernel on the gpu

  cudaMemcpy(A_host,A_device,1024*1024*sizeof(float),cudaMemcpyDeviceToHost);   //copy the results from the gpu back to host memory

  //do whatever else you want to do with A_host

  return 0;

}

Dear sir

thank u for reply… thing is we are beginners so, we need some time to learn . … please clarify my doubts kindly

have a nice day sir

                                 thanking you sir

Ocire · January 4, 2009, 1:15pm

sorry for sounding rude. i should have added “in different threads”. it’s totally ok to post a “sorry, i don’t understand” after an answer but, imho, opening a second thread with nearly the same question and replying with this question to another thread with a completely different theme isn’t.

it would help alot if you could tell what exactly you don’t understand in the code above. ;-)

buj · January 4, 2009, 2:23pm

i have gone through tat code … i got it now …

thank you sir

Sarnath · January 5, 2009, 5:31am

:) I can understand…

buj · January 7, 2009, 12:20pm

dear sir

can u tell sme how to calculate GPU time for A[1024][1024] … which was written by you… how to initialize timer ,start timer for my two dimwnsional arry

E.D_Riedijk · January 7, 2009, 2:31pm

Almost all SDK examples have timing included, you can copy&paste from there.

buj · January 7, 2009, 4:36pm

i have seen those sdk examples… but i am little bit confusing about tat … thing is where i have to initialize timers for below mentioned program… i want know how much time it is taking for A[1024] [ 1024] on the gpu … hwlp me sir

global static void cudaFunction(float *A)

{

int x=threadIdx.x+blockIdx.x*blockDim.x;

int y=blockIdx.y;

A[x+y*1024]=whatever you want to do;

}

int main(int argc,int *argv){

float A_host=(float)malloc(10241024sizeof(float)); //allocate the array in host memory

//do preprocessing or whatever with A_host, use A_host[x+y*1024] to access elements

float *A_device;

cudaMalloc((void**)&A_device,10241024sizeof(float)); //allocate the array in memory on your graphics card

cudaMemcpy(A_device,A_host,10241024sizeof(float),cudaMemcp

yHostToDevice); //copy the contents of the host array to the device array, so you can work with that on the gpu

cudaFunction<<<dim3(1024/256,1024,1),dim3(256,1,1)>>>(A_device); //call the kernel on the gpu

cudaMemcpy(A_host,A_device,10241024sizeof(float),cudaMemcp

yDeviceToHost); //copy the results from the gpu back to host memory

//do whatever else you want to do with A_host

return 0;

}

buj · January 7, 2009, 4:39pm

Dear sir can you expalin below mentioned code and is there any other way to declare this code …

cudaFunction<<<dim3(1024/256,1024,1),dim3(256,1,1)>>>(A_device)

E.D_Riedijk · January 7, 2009, 6:08pm

Really, you should read the programming guide. Your questions suggest that reading it carefully will help you a lot.

buj · January 8, 2009, 5:41pm

Dear sir

i have gone through that programing guide … but there is no timer examples … i am asking like this … plse kindly clarify my doubts

thanking you sir

tmurray · January 8, 2009, 6:41pm

look at cutil.h (it has doxygen-style comments) or just use cudaEvents and cudaEventElapsedTime.

buj · January 9, 2009, 2:14pm

ok … thank you sir

Topic		Replies	Views
Urgent help with threads please! CUDA Programming and Performance	21	10908	March 6, 2008
Quick Thread Question Regarding Calling a kernel CUDA Programming and Performance	13	3684	June 26, 2008
Newbie help on thread blocks CUDA Programming and Performance	22	10729	December 24, 2008
Understanding Threads in CUDA help me find the exact number of threads for my code CUDA Programming and Performance	4	2391	July 13, 2009
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18446	October 17, 2013
trouble learning how to set block and max thread size CUDA Programming and Performance	4	2014	January 26, 2011
blocks and threads CUDA Programming and Performance	3	4200	November 17, 2008
Kernel Question CUDA Programming and Performance	3	4749	March 4, 2012
Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction CUDA Programming and Performance	14	13376	January 24, 2009
Optimization problem how many blocks/ threads... CUDA Programming and Performance	1	1922	July 9, 2010

how to use threads for A[1024][1024] threads

Related topics