looking for further suggestion to speed up the code

Hi there,
I follow one example found online to port my serial c++ code to CUDA. What I learn from that code is to write the two-fold nested loop by a kernel function and certain number of grids and threads. I have the # of grids and # of threads both set to 600. My device is Tesla C2075. I wonder how to find out the maximum threads and maximum grids supported by this device?

The following version in CUDA speed up my code and save me 10 hours but it still takes quite a long time to run. I wonder if there is any space to further speed up the code? I find even I unroll two nested loop by a kernel there are still two more loops to make run quite slow.

#include <cuda.h>
#include <iostream>
#include "matio.h"

using namespace std;

#define N 600

int InitGPUSet()  
{  
  char GPU[100] = "GPU: ";  
  cudaDeviceProp tCard;  
  int num = 0;  
  if (cudaSuccess == cudaGetDeviceCount(&num))  
  {  
    for (int i = 0; i < num; ++ i)  
    {  
      cudaSetDevice(i);  
      cudaGetDeviceProperties(&tCard, i);  
      puts(strcat(GPU , tCard.name));
     }  
   }  
   else return 0;  
   return 1;  
}

__global__ void findStd(double A0, double B0, int *data)
{
  unsigned int x = blockIdx.x;
  unsigned int y = threadIdx.x;
  int res=0;
  // some works here to calculate the result based on x, y, A0 and B0
  // res will be either 1 or 0 based on above result
  if (res==1) atomicAdd(data,1);
}

int main(void)
{
  if(!InitGPUSet())  
  {
    puts("device is not ready!");  
    cout << "error" << endl;
  }
  else  
  {  
    int AR[400][400];
    int *Hdata;
    int *Ddata;

    Hdata = (int*)malloc(sizeof(int));
    cudaMalloc((void**)&Ddata, sizeof(int));

    for (int i=0; i<400; i++)
    {  
      for (int j=0; j<400; j++)
      {
        *Hdata = 0;
        cudaMemcpy(Ddata, Hdata, sizeof(int), cudaMemcpyHostToDevice);
        findStd<<<N, N>>>(0, 0, Ddata);
        cudaMemcpy(Hdata, Ddata, sizeof(int), cudaMemcpyDeviceToHost);
        if ((*Hdata)>800)
        {
          AR[i][j] = 1;
        }
        else AR[i][j] = 0;
      }
    }
    free(Hdata);
    cudaFree(Ddata);
  }
}

I am reading other examples about nested loop. Since in above code, I have to run the kernel 400x400 times, each kernel will be invoked as findStd<<600, 600>>. So in this case, does it mean I will run 600 blocks and each blocks has 600 threads? If that’s what it mean, I am thinking if I could setup two dimensional grids and two dimensional blocks to remove all those loop. The code is as follow

#include <cuda.h>
    #include <iostream>
    #include "matio.h"

    using namespace std;

    #define N 600
    #define M 400

    int InitGPUSet()
    {
    char GPU[100] = "GPU: ";
    cudaDeviceProp tCard;
    int num = 0;
    if (cudaSuccess == cudaGetDeviceCount(&num))
    {
    for (int i = 0; i < num; ++ i)
    {
    cudaSetDevice(i);
    cudaGetDeviceProperties(&tCard, i);
    puts(strcat(GPU , tCard.name));
    }
    }
    else return 0;
    return 1;
    }

    __global__ void findStd(double A0, double B0, int *data)
    {
      unsigned int x = gridIdx.x;
      unsigned int y = gridIdx.y;
      unsigned int a = blockIdx.x;
      unsigned int b = blockIdx.y;
      int res=0;
      // some works here to calculate the result based on x, y, A0 and B0
      // res will be either 1 or 0 based on above result
     if (res==1) atomicAdd(data,1);
    }

    int main(void)
    {
      if(!InitGPUSet())
      {
        puts("device is not ready!");
        cout << "error" << endl;
      }
      else
      {
        int AR[M*M];
        int *Hdata;
        int *Ddata;

        Hdata = (int*)malloc(M*M*sizeof(int));
        cudaMalloc((void**)&Ddata, M*M*sizeof(int));

        for (int i=0; i<M*M; i++) Hdata[i] = 0;
        cudaMemcpy(Ddata, Hdata, M*M*sizeof(int), cudaMemcpyHostToDevice);
        findStd<<<grids, blocks>>>(0, 0, Ddata);
        cudaMemcpy(Hdata, Ddata, M*M*sizeof(int), cudaMemcpyDeviceToHost);
        ...
      }
      free(Hdata);
      cudaFree(Ddata);
    }
  }

Do you think this will work? Our cluster was down and will not back in 24 hours, so I cannot test it. But I think the scheme works. Any comment or suggestion is welcomed. Thanks.

You go to this page https://developer.nvidia.com/cuda-gpus and find the compute capability (cc) of your gpu. The Tesla C2075 is 2.0. Now you can check the wikipedia page http://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications where you can see all the specifications. For the 2.0 cc you have 1024 threads per block and (65535x65535X65535) blocks.

Alternatively you can check nvidia sdk examples which were installed in you /usr/local/cuda folder, compile the devicequery program located in folder 1_utilities and it will give all this information.

Thanks. I just test the code in another computer which support 65535x65535 grids and 1024x1024x64 blocks and 1024 threads/block also. But I found that above code doesn’t work. One thing is confusing me, I found the following description in the CUDA document

dim3
grid(16, 16);
dim3
block(16,16);
kernel<<<grid, block>>>(...);

So the first parameters in <<< >>> is the dimension of grid and the second parameter is the dimension for block. However, in the body of kernel function, the following is invalid because no gridIdx.x and gridIdx.y defined

unsigned int x = gridIdx.x;
unsigned int y = gridIdx.y;
unsigned int a = blockIdx.x;
unsigned int b = blockIdx.y;

In other online documentations, someone said the first parameter is the dimension of block and the second one is the dimension of threads. So which one is correct?

How can I get NxN grids and MxM blocks to work? Also, if I have MxM blocks to work, does it imply that I have all 1024 threads working in each block or can I specify how many threads to work in each block?

It is correct. This will create 256 blocks with 256 threads per block. You can use alse 32 x 32 , also do not forget to compile for cc 20 or higher with flag --arch=sm_20

It is correct. This will create 256 blocks with 256 threads per block. You can use alse 32 x 32, or 1024 x 1. They are equivalent. , also do not forget to compile for cc 20 or higher with flag --arch=sm_20

Sorry for asking this again since I am still confusing. So if I run kernel in the following way

kernel<<<dim3(256, 256), 1024>>>

Does it mean I will have 256x256 blocks and each block will have 1024 threads?

In the real code, for Tesla C2075, I run my kernel as

kernel<<<dim3(600, 600), 1024>>>

so it will have 600x600 blocks and each running 1024 threads?

If the second parameters determine the size for threads, I think it will be wrong if I use it as

kernel<<<dim3(600, 600), dim3(400, 400)>>>

I just don’t understand why the document of CUDA said the first dim3 is for grid, the second dim3 is for block?

Yes. It would be wrong to to have

kernel<<<dim3(600, 600), dim3(400, 400)>>>

. This would imply 16000 threads per block.
First argument tells the gpu the number of blocks. Second argument tells the number of threads per block.

Thanks, it makes senses now. So how do we choose how many grid used for that kernel? Or does it mean one kernel is one grid? Can we start multiple kernel in parallel so to use multiple grid simultaneously?

You can run conccurent kernels using streams.
when the code encounters a line like this kernel<<<dim3(600, 600), dim3(16, 16)>>> it means that it sent the command to run the kernel and then immediate goes to next line. If you have this line 10 times this measns that the host part will execute what ever is after and if there is no explicit sync with the gpu it might even go close the program before the gpu part is finished. So the cou can submit kernels on different streams to the gpu and those will be execute concurrently as resources on gpu become available.