I have just started messing around with CUDA and I have a few questions. I’m looking for some feedback to improve my skills.
I’ve modified the simpleGL example from the SDK with the intent of showing something of a speed comparison between CUDA and sequential CPU computation(simpleGLtest.zip). I’ve created a runCPU function analogous to runCuda which calls glMapBufferARB on the VBO and simply iterates through the entire array calculating the sin*cos function. Program flow can be toggled between one and the other by pressing the ‘c’ key. I’ve also created a function to resize and reallocate the VBO, increasing(= key) or decreasing(- key) size by powers of 2. I am writing the frame rate(actually inverse of calculation time) to the screen with glutBitmapCharacter, so performance can be observed(hopefully).
Pressing the ‘p’ key toggles drawing in GL_QUADS mode with an indices array that I create in host memory. Would drawing performance be better if I created the indices buffer on the device with cudaMalloc?
I’ve noticed that part of the mesh disappears when drawn in polygon mode at high mesh sizes. Why is only part of the surface drawn at 512x512 and even less at higher resolutions? Am I running out of memory on the device?
Cuda calculation appears roughly equivalent to CPU on my system at mesh size 256x256, and significantly faster at higher resolutions, which is to be expected. However, at lower resolutions, CPU calculation appears to be mush faster, while Cuda remains locked at vsync rate(75Hz on my machine). Is Cuda somehow intrinsically tied to vsync?
Any feedback whatsoever is appreciated. This is my first foray into Cuda programming, so please feel free to be brutally honest. Any style or technique pointers would also be eagerly received.
My system specs:
Athlon 64 CPU 3500+ 2.21GHz, 512MB RAM, Quadro FX 570 256 MB RAM
Driver 178.28, SDK_2.02.0811.0240, toolkit 2.0
Windows XP, Visual C++ 2005 Express
simpleGLtest.zip (11.3 KB)