CUDA code optimization Trying to make this run faster


I’m in the process of using GPU based computing via CUDA to show its’ efficiency in high demand computational environments. My task for the gpu consists of farming the source term of the Teukolsky equations to the GPU to allow for faster computation. For single precision elements I have been able to show a considerable speed up as opposed to using the GPU, but this application requires double precision and this is where my lack of experience in CUDA really catches up with me, as I am kind of stuck as to where/how to optimize this code. I don’t know if maybe there’s something painfully obvious to those of you who know CUDA much better than I do as to how I can optimize the speed of the program. Below is the program as I have it written. It’s particularly painful but VERY necessary calculation in the field of astrophysics and any help provided to me will be mentioned in my publication of my research when it is complete. Below is my code as I have written it so far.

#include <cuda.h>

#include "mycmplx.h"

#define PTS 240

__global__ void csourced(double, double, double, double,

		 double, double, double,

		 double, double, double,

		 double, double,

		 double*, double*, double*);


__global__ void csourced(

		 double*, double*, double*);


extern "C" {

void csource_(double *cth, double *crp, double *cphip, double *ctp,

		 double *cdrdt, double *cd2rdt2, double *cd3rdt3,

		 double *cdthdt, double *cd2thdt2, double *cd3thdt3,

		 double *cdphidt, double *cd2phidt2,

		 double *cr, double *ctre, double *ctim, int *cip)


int k, size;

  double* rh;

  double* rd;

  double* tre;

  double* tim;

  double* tred;

  double* timd;

size = PTS * sizeof( double );

	 rh = (double *)malloc(size);

	 tre = (double *)malloc(size);

	 tim = (double *)malloc(size);

	  for (k = 0; k < PTS; k++)


		  rh[k] =

			*(cr + *cip - (PTS / 2 - 1) + k);


cudaMalloc((void**)&rd, size);

  cudaMalloc((void**)&tred, size);

  cudaMalloc((void**)&timd, size);

  cudaMemcpy(rd , rh, size, cudaMemcpyHostToDevice);

dim3 dimBlock(1);

  dim3 dimGrid (PTS/dimBlock.x);

csourced<<<dimGrid, dimBlock>>>(*cth, *crp, *cphip, *ctp,

		 *cdrdt, *cd2rdt2, *cd3rdt3,

		 *cdthdt, *cd2thdt2, *cd3thdt3,

		 *cdphidt, *cd2phidt2,

		 rd, tred, timd);

  cudaMemcpy( tre, tred, size, cudaMemcpyDeviceToHost );

  cudaMemcpy( tim, timd, size, cudaMemcpyDeviceToHost );

	  for (k = 0; k < PTS; k++)


			  *(ctre + k) = tre[k];

			  *(ctim + k) = tim[k];











__global__ void csourced(

		 double* rr, double* ans1, double* ans2) {

		int idx = blockIdx.x * blockDim.x + threadIdx.x;

		 ans1[idx] = (double)idx;

		 ans2[idx] = (double)idx;



__global__ void csourced(double th, double rp, double phip, double tp,

		 double drdt, double d2rdt2, double d3rdt3,

		 double dthdt, double d2thdt2, double d3thdt3,

		 double dphidt, double d2phidt2,

		 double* rr, double* ans1, double* ans2) {

cudacomplex i;

  i.real = 0.f;

  i.img = 1.f;

int j;

  double M, a, nmu;

  double r, wt, wr, stheta, ctheta, mm, pie, lz, E, Q;

  double ff, cs2, delta;

  double DelR, DelR1, DelR2;

  double DelTH, DelTH1, DelTH2;

And after this I begin calculating the 4000 pieces of the source term which aren’t CUDA dependent. Is there any way to optimize the code I have provided (i.e. make it faster?) I’m not asking you to re-write it for me, but I don’t have too much experience with CUDA and what I have written so far has been very painful to say the least.

This will probably help.

I read through the article, and have found a couple areas that I feel I can improve the overall code. Mostly with mathematical operations and such. I guess I wasn’t really clear with my question at first, so I’ll try and rephrase it. Did I set up the CUDA device to perform in the most efficient way?