I have an simulation in which I render an ocean surface similar to the oceanFFT demo, except mine is much much larger. I had written a vertex shader to do my calculations of a grid in the size of 611,806 verts and it runs reasonably well. I can get roughly 75 frames or so out of it. I am trying to port the same code to be used with Cuda with an expected performance gain, or at the least the same performance but I am getting much worse, less than 20 frames. Everything that I’m reading leads me to believe that anything a shader can do, Cuda can do better so I am sure the flaw lies within my implementation.
The graphics card I’m using is a 9800 gtx+ and I’m using the D3D Interoperability functions to get the vertex buffer and send it off to the Cuda kernel to perform its calculations. This is a paste of my Cuda file, it doesn’t look too complicated, the only expensive calculation it’s doing is a sin function 15 times (num_waves) per vertex. It seems like a lot, but if the shader can handle it Cuda should be able to also.
#include <stdlib.h>
#include <stdio.h>
#include <cutil_math.h>
//Round a / b to nearest higher integer value
int cuda_iDivUp(int a, int b)
{
return (a + (b - 1)) / b;
}
struct myfloat6
{
float x, y, z;
float a, b, c;
float u, v;
};
struct myfloat3
{
float x, y, z;
};
struct Wave
{
float n;
float K0;
float amp;
float omega;
float directionX;
float directionY;
float random;
};
// dot product
float dot_product_2d(float2 a, float2 b)
{
return (a.x * b.x + a.y * b.y);
}
float dot_product_3d(float3 a, float3 b)
{
return (a.x * b.x + a.y * b.y + a.z * b.z);
}
float length_3d(float3 a)
{
return sqrt(a.x * a.x + a.y * a.y + a.z * a.z);
}
// CUDA modify the buffer
__global__ void cuda_kernel_ocean_height(myfloat6 *vb, int width, int height, float time, float camera_offsetX, float camera_offsetZ, Wave *waves, int num_waves)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= width || y >= height)
return;
float2 p2;
//float btnTemp = 0.0f;
float3 normal;
normal.x = 0.0f;
normal.y = 1.0f;
normal.z = 0.0f;
// Offset the position we're working on by the camera
p2.x = vb[y * width + x].x + camera_offsetX;
p2.y = vb[y * width + x].z + camera_offsetZ;
float h = 0;
float arg;
float sarg;
for (int i = 0; i < num_waves; i++)
{
arg = (p2.x * waves[i].directionX + p2.y * waves[i].directionY) * waves[i].K0 - (time * waves[i].omega + waves[i].random);
sarg = sin(arg);
//float carg = sin(arg);
h += waves[i].amp * sarg;
//// Calculate the normal.
//btnTemp = waves[i].directionX * waves[i].K0 * waves[i].amp;
//btnTemp = btnTemp * carg;
//normal.x += btnTemp;
//btnTemp = waves[i].K0 * waves[i].amp;
//btnTemp = btnTemp * sarg;
//normal.y += btnTemp;
//btnTemp = waves[i].directionY * waves[i].K0 * waves[i].amp;
//btnTemp = btnTemp * carg;
//normal.z += btnTemp;
}
// Set the new height of the point
vb[y * width + x].y = h;
//normal.x *= -1.0f;
//normal.y = 1.0f - normal.y;
//normal.z *= -1.0f;
//normal = normalize(normal);
// Set the normal in the vertex buffer
vb[y * width + x].a = normal.x;
vb[y * width + x].b = normal.y;
vb[y * width + x].c = normal.z;
}
// CUDA Kernel for wave calculation
extern "C"
void calculate_ocean(myfloat6 *vb, int width, int height, float time, float camera_offsetX, float camera_offsetZ, Wave *waves, int num_waves)
{
// block dimensions for 256 threads
dim3 Db = dim3(16, 16, 1);
// grid dimensions, rounds up to build a large enough grid
dim3 Dg(cuda_iDivUp(width, Db.x), cuda_iDivUp(height, Db.y), 1);
cuda_kernel_ocean_height<<<Dg, Db>>>(vb, width, height, time, camera_offsetX, camera_offsetZ, waves, num_waves);
}
For width and height I’m passing in the root of the number of verts in my ocean grid.
Does anyone have any ideas on what I am doing wrong? Any information that I left out? I appreciate any help that I can get.