Vertex shader VS Cuda performance seems wrong

I have an simulation in which I render an ocean surface similar to the oceanFFT demo, except mine is much much larger. I had written a vertex shader to do my calculations of a grid in the size of 611,806 verts and it runs reasonably well. I can get roughly 75 frames or so out of it. I am trying to port the same code to be used with Cuda with an expected performance gain, or at the least the same performance but I am getting much worse, less than 20 frames. Everything that I’m reading leads me to believe that anything a shader can do, Cuda can do better so I am sure the flaw lies within my implementation.

The graphics card I’m using is a 9800 gtx+ and I’m using the D3D Interoperability functions to get the vertex buffer and send it off to the Cuda kernel to perform its calculations. This is a paste of my Cuda file, it doesn’t look too complicated, the only expensive calculation it’s doing is a sin function 15 times (num_waves) per vertex. It seems like a lot, but if the shader can handle it Cuda should be able to also.

#include <stdlib.h>

#include <stdio.h>

#include <cutil_math.h>

//Round a / b to nearest higher integer value

int cuda_iDivUp(int a, int b)

{

	return (a + (b - 1)) / b;

}

struct myfloat6

{

	float x, y, z;

	float a, b, c;

	float u, v;

};

struct myfloat3

{

	float x, y, z;

};

struct Wave

{

	float n;

	float K0;

	float amp;

	float omega;

	float directionX;

	float directionY;

	float random;

};

// dot product

float dot_product_2d(float2 a, float2 b)

{

	return (a.x * b.x + a.y * b.y);

}

float dot_product_3d(float3 a, float3 b)

{

	return (a.x * b.x + a.y * b.y + a.z * b.z);

}

float length_3d(float3 a)

{

	return sqrt(a.x * a.x + a.y * a.y + a.z * a.z);

}

// CUDA modify the buffer

__global__ void cuda_kernel_ocean_height(myfloat6 *vb, int width, int height, float time, float camera_offsetX, float camera_offsetZ, Wave *waves, int num_waves)

{

	int x = blockIdx.x * blockDim.x + threadIdx.x;

	int y = blockIdx.y * blockDim.y + threadIdx.y;

	if (x >= width || y >= height)

		return;

	float2 p2;

	//float btnTemp = 0.0f;

	float3 normal;

	normal.x = 0.0f;

	normal.y = 1.0f;

	normal.z = 0.0f;

	// Offset the position we're working on by the camera

	p2.x = vb[y * width + x].x + camera_offsetX;

	p2.y = vb[y * width + x].z + camera_offsetZ;

	float h = 0;

	float arg;

	float sarg;

	for (int i = 0; i < num_waves; i++)

	{

		arg = (p2.x * waves[i].directionX + p2.y * waves[i].directionY) * waves[i].K0 - (time * waves[i].omega + waves[i].random);

		sarg = sin(arg);

		//float carg = sin(arg);

		h += waves[i].amp * sarg;

		//// Calculate the normal.

		//btnTemp = waves[i].directionX * waves[i].K0 * waves[i].amp;

		//btnTemp = btnTemp * carg;

		//normal.x += btnTemp;

		//btnTemp = waves[i].K0 * waves[i].amp;

		//btnTemp = btnTemp * sarg;

		//normal.y += btnTemp;

		//btnTemp = waves[i].directionY * waves[i].K0 * waves[i].amp;

		//btnTemp = btnTemp * carg;

		//normal.z += btnTemp;

	}

	// Set the new height of the point

	vb[y * width + x].y = h;

	//normal.x *= -1.0f;

	//normal.y = 1.0f - normal.y;

	//normal.z *= -1.0f;

	//normal = normalize(normal);

	// Set the normal in the vertex buffer

	vb[y * width + x].a = normal.x;

	vb[y * width + x].b = normal.y;

	vb[y * width + x].c = normal.z;

}

// CUDA Kernel for wave calculation

extern "C"

void calculate_ocean(myfloat6 *vb, int width, int height, float time, float camera_offsetX, float camera_offsetZ, Wave *waves, int num_waves)

{

	// block dimensions for 256 threads

	dim3 Db = dim3(16, 16, 1); 

	// grid dimensions, rounds up to build a large enough grid

	dim3 Dg(cuda_iDivUp(width, Db.x), cuda_iDivUp(height, Db.y), 1);

	cuda_kernel_ocean_height<<<Dg, Db>>>(vb, width, height, time, camera_offsetX, camera_offsetZ, waves, num_waves);

}

For width and height I’m passing in the root of the number of verts in my ocean grid.

Does anyone have any ideas on what I am doing wrong? Any information that I left out? I appreciate any help that I can get.