Fast Substitution Substitute 3D array

I have a 3D array of floats with 256^3 dimension. If I call the dimensions of this 3D array cube as [X, Y, Z] then I would like to calculate each cell value of this cube by the following substitution:

ZX(-1+cos(1/20*(40Z^2+10X^2+10*Y^2)^(1/2) ) )

In the above equation, X, Y and Z are the index value of the corresponding cell in the cube.

I guess that using CUDA for this calculation will be 10 times faster then CPU. Do I need to use CudaBLAS library or do I need to write raw device functions…

Also is there any link or sample or source in Cuda SDK just for this substitution operation?

I am a little confused by your terminology. Are you simply trying to populate the matrix based on your function? In this case the matrix will contain the same values each time it is populated, right?

Assuming you are just wanting to populate the matrix…

On a machine with a q6600 oc to 3.2GHz and a 8800GTX compiled when compiled with gcc 4.1.2 20061115 and -O2 I see .88s on the CPU (1.19 with -ffloat-store) and .15s on the GPU (a lot of time in the memcpy). Of course the CPU version could easily be improved with OpenMP to the point of probably matching the GPU version. GPU occupancy is 100%.

method=[ matGen_kernel ] gputime=[ 4794.016 ] cputime=[ 4808.000 ] occupancy=[ 1.000 ]

method=[ memcopy ] gputime=[ 35763.070 ]

with n=512 the difference is more pronounced with the CPU requiring 6.8s and the GPU .59s.

CPU version:

#include <stdio.h>

#include <time.h>

#include <math.h>

int

main() {

  clock_t c = clock();

 const int n = 256;

  float *mat = new float[n*n*n];

 float *p = mat;

  for(int z=0; z<n; z++) {

    for(int y=0; y<n; y++) {

      for(int x=0; x<n; x++) {

	*p++ = z*x*(-1+cosf(1/20.f * sqrtf(40.f * z*z + 10.f * x*x + 10.f * y*y)));

      }

    }

  }

 printf("%f\n", (clock() - c) / double(CLOCKS_PER_SEC));

 {

    FILE *file = fopen("cpu.txt", "w");

    for(int i=0; i<n*n*n; i++) {

      fprintf(file, "%f\n", mat[i]);

    }

    fclose(file);

  }

 delete[] mat;

 return 0;

}

GPU version:

#include <stdio.h>

#include <time.h>

#include <math.h>

#include <cutil.h>

#define CUDA_CHECK_ERROR() \

  {\

    cudaError_t ce = cudaGetLastError();\

    if(ce != cudaSuccess) {\

      printf("%s\n", cudaGetErrorString(ce));\

      exit(EXIT_FAILURE);\

    }\

  }

const int n = 256;

__global__ void matGen_kernel(float *mat) {

  const int z = blockIdx.x;

  const int y = blockIdx.y;

  const int x = threadIdx.x;

 mat[z*256*256+y*256+x] = z*x*(-1+cosf(1/20.f * sqrtf(40.f * z*z + 10.f * x*x + 10.f * y*y)));

}

int

main() {

  clock_t c = clock();

 float *mat_dev;

  float *mat = new float[n*n*n];

 CUDA_SAFE_CALL(cudaMalloc((void**)&mat_dev, n*n*n*sizeof(float)));

  CUDA_CHECK_ERROR();

 dim3 grid(n, n);

  dim3 threads(n);

  matGen_kernel<<<grid, threads>>>(mat_dev);

 CUDA_SAFE_CALL(cudaMemcpy(mat, mat_dev, n*n*n*sizeof(float), cudaMemcpyDeviceToHost));

  CUDA_CHECK_ERROR();

 printf("%f\n", (clock() - c) / double(CLOCKS_PER_SEC));

 {

    FILE *file = fopen("gpu.txt", "w");

    for(int i=0; i<n*n*n; i++) {

      fprintf(file, "%f\n", mat[i]);

    }

    fclose(file);

  }

 CUDA_SAFE_CALL(cudaFree(mat_dev));

  CUDA_CHECK_ERROR();

 delete[] mat;

 return 0;

}

But if you don’t need the result in CPU memory afterwards you can leave out the memcpy from the timings and replace it with a cudaSynchronizeThread(). In which case the speedup will be a lot higher, probably.

Thank you very much for your reply. In the future I dont need to copy it back to CPU memory so you are right. But I dont understand where to use “cudaSynchronizeThread” function. In which line do I need to add this call? For which purpose? arent these threads already synchronized?

Thank you very much, everything is very clear and this is what I was looking for. After reading again and again the SDK now I understand how it works.

I only have the following questions:

  1. what is the occupancy here ? how it is measured?
  2. is there any special reason not to use __cos device function instead cosf ?

Best

Based on Appendix B of the programming guide, it looks like __cos() is only a good idea if you know your argument is in the range [-pi,pi]. cosf() is accurate over the interval [-48039, +48039] presumably because it does some kind of argument reduction first before calling __cos().

I’m not sure cudaSynchronizeThread() is is needed between kernel calls but it is needed it you want to measure the timing of the kernel. Simply insert the call immediately following the kernel call. Without it the kernel will appear to have completed much faster than it actually did.

At least two ways…

Set the environment variable CUDA_PROFILE to 1, run your application, and then examine the newly generated file. I forget the name… cuda_profile.txt?

Use the -keep compiler option, compile your program, examine the .cubin file, and plug the numbers into the NVIDIA provided Excel occupancy calculator spreadsheet.

You might want to experiment with how the x, y, z indices are calculated as different memory access patterns can have quite an impact.