Real time Mandelbrot My first CUDA program

UPDATED: I replaced the texture bind/unbind code with a call to cudaThreadSynchronize(). I also removed the unneeded double precision math functions from the CPU version of the code and converted it to double precision.

UPDATED: There was a bug in the high precision multiply. Either the G80 has a precision issue in the multiply-add instruction or I did not translate the original fortran code correctly. I used an alternate version of dsmul code to solve the problem. You can now zoom in much further than before.

UPDATED: I implemented double precision math functions to increase the maximum useful zoom factor. You will notice a big drop in the frame rate when the double precision math kicks in.

UPDATED: I unrolled the Mandelbrot loop and got a 17% speed increase.

Hi all. I am just starting CUDA programming and wrote a simple Mandelbrot program as part of my learning process. I am submitting the project files for anyone to play around with. I did a timing test and found that it runs 85.5 times faster on an 8800 GTX GPU than an AMD Opteron 2GHZ processor.

I have attached the project files. You can build them unzip the folder into the CUDA SDK projects folder. The program renders the Mandebrot set at about 60 frames per second and uses adaptive samping to anti-alias the image. When not animated, it will perform 128 passes of full frame anti-aliasing. You can randomize the color palette with the ‘c’ or ‘C’ keys and animate the colors with the ‘a’ or ‘A’ keys. Scrolling and zooming is performed with the mouse left and right buttons respectively while dragging. You can also use the ‘d’ and ‘D’ keys to increase or decrease detail.

Enjoy!

Mark Granger
New Tek

Very nice, thanks for posting the demo!

It would be cool to add multiple GPU support, with each card calculating a subset of the image.

So, it should be a small step from this to a fully CUDA-accelerated version of Lightwave? :)

I was thinking about what Tesla could do with this program but I am more interested in what can be done on the G90 boards. If, as we suspect, the G90 has native support for double precision math and increases the number of multi-processors per chip, it should be able to compute the Mandlebrot set a lot faster and with a lot more precision. There are also a lot of optimizations that could be implemented. The code currently takes a brute force approach rather than using smart algorithms.

Mark Granger
New Tek

I had a similar issue porting over the dsfun90 implementation of double precision operations. Initially I used the code path that assumed a MAD stage with no intermediate round-off. However, the CUDA Programming Guide (pg 67) says:

Then I switched to the slower code path that does not assume a MAD instruction, and everything was fine.

Incidentally, if anyone is looking for a CUDA-related development project, a full port of all the functions in the dsfun90 library to CUDA would get you the unending gratitude of many current and future developers. :) (The transcendentals will be tricky, so this is more than just a weekend project…)

I’ve found on two projects now that a judicious mixture of single precision and software emulated double precision can boost the final accuracy of a calculation without a significant performance penalty. If you are already memory I/O bound, there might be no speed change at all!

I think it would be cool if NVidia could build in double precision math support into their Cuda compiler. This would be very helpfull with code portability when their next generation GPUs hit the market that support double precision natively.

I had contemplated putting in the effort to port the dsfun90 package to CUDA but have held off because the NVCC compiler doesn’t have #pragma support to prevent the optimizer from rearranging sensitive sections of code. This can also be an issue even with stuff like Kahan summation and other algorithms that require that certain critical sections of code not be optimized away by the compiler. Some months back I filed a feature request asking for #pragma constructs for this purpose, but I don’t know where that stands. I believe one complication with implementing #pragmas in the CUDA toolchain is that they would have to be communicated all the way to the PTX assembler and back-end optimizer, and NVIDIA may not have the compiler infrastructure in place for doing such things yet. In any case, when the compiler matures a bit more, and we have finer-grained control over the optimizer through the use of #pragma etc, I would be happy to put serious effort into dsfun90 and similarly sensitive floating point library code. The dsfun90 functions will be useful even when we have double-precision hardware, since some codes can still benefit from even better precision!! :-)

Cheers,

John Stone

Mark - would you mind if we included this code in the next release of our SDK? We’ll give you full credit, of course.

BTW, the texture bind/unbind code you have is unnecessary since you aren’t actually doing any texturing in the kernel. If you take out this out you need to add a cudaThreadSynchronize() so that the timing is correct.

Please do! I would be honored.

I suspected that I may not need the texture bind/unbind code. I started with one of the other SDK examples which needed to do that. I have update the code.

-Mark Granger