CUDA on 15" Macbook Pro (disappointing)

Double precision floating point (DPFP) operations are critical for scientific programming.

The Nvidia GT 750 M card on the 15" Macbook Pro retina has poor support for DPFP operations. It is difficult to run a basic CUDA case (Note : DPFP operations) on the GT 750M.

The Nvidia 750 M card offers about 21 Gflops while the host i7 on the 15" Macbook pro offers 120 Gflops. See http://tinyurl.com/cuda-on-mac for details.

It would of great help if some forum members could suggest ways to boost it’s DPFP performance just for development purposes (i.e. check speed up, debug a test case etc.). Also are there any recommendations for a mobile CUDA development unit for scientific programming ?

Note that you get your CPU’s 120 GFlops double precision only when fully expoiting SSE2 or AVX(2) on the CPU and doing multithreading properly.

A lot of scientific problems can also be adequately solved in single precision, sometimes requiring a bit of adaptation at places where rounding errors may have the most impact (Kahan summation, etc…)

Also consider using special math libraries (sometimes called double single, or DSMath), combining two single precision floats to get precision just a bit shy of a 64 bit (double) float. These functions have been posted on the forums previously, and I believe nVidia is now also providing double single precision code through their developer download channels.

One forum member once noted he was able to get a very decent speed up by first getting a coarse solution in single precision, and then doing a few more newton-raphton iterations with the (much slower) double precision on consumer cards to get to the exact solution.

…also perhaps focus on scalability and the scalability of your code

often times it is not that difficult to write scalable code, and with scalable code you should have little problem developing on a ‘mobile platform’

it’s just hard to test scalability on a development system if you don’t have
a) a multi GPU system
b) a built in GPU with only about 4 SMX’es

unless you have a test rig to take your code to… developing for scalability is not going to work ;-)

i was thinking of scaling from a device with few(er) SMs to a (real, practical, non-mobile) device with more SMs…