advice needed by a PhD student


I am a Economics graduate student who is working on computationally intensive project. Normally, I have access to a cluster to use MPI, but for this project I am restricted to a single computer with no internet connection due to data privacy agreements.

My project requires likelihood estimation of dynamic programming model, which involves an outer loop to search for a maximum of a (numerical, not analytical) function and an inner loop that creates that function via solving a series of Bellman equations via backwards recursion for a very large set of values.

People normally uses cluster with MPI to handle the computational burden of this problem, but being stuck on a dual core desktop, I am looking a way out of waiting a month for a single run to finish. The problem is I have to do this a number of times, so speed is of the essence.

then, I heard about CUDA and thought “here is my ticket out of this”. Also seems to be great investment for the future. But there are some problems: (1) I don’t know C/C++, just fortran, matlab etc. (2) I am very poor and don’t have a grant to buy a 5000 machine. I don’t think I can afford a tesla, maybe a 300-400 dollars GTX XXX. Or I can’t purchase the PGI cuda fortran. (3) I don’t have too much time to do and learn things.

I (desperately) need some advice on the feasibility of this via cuda. Here are my questions, anything you can chime in, I’d be grateful:

1- The jump from fortran to cuda c/c++, from CPU to GPU computing: how hard is this? especially thinking about the engineering of an hybrid code that is optimal on multicore/GPU machine for someone who wrote only 2k-3k lines of code at most.
2- Is PGI fortran route better? I don’t have money to buy it but maybe I can beg around to them or my faculty to get some help, if it is worth it.
3- what kind of a GPU I should get? It seems not many geforces have good amount of memory. I eyed a dual card, GTX 460 2win for 350$ or so and with lots of cores and 2 gb ram. Is this a good idea? I might need to buy two cards for this for my home computer as well to test stuff, a cheap card like GT420…Are these good enough to help learn CUDA and run a smaller versions of my code?
4- Any ideas about performance gain I can get? If I went through the whole thing and I got 1.2x speed increase, that would be a shame, but even 2x is very welcomed.
5- Any other advice, sources to get started?

thanks for listening.


EDIT: I have also noticed there are cheaper cards GT420s and GT520s and GTX460s with 2gb RAM on a single card. GTX 460 2win seem to have 1 gb for each card, is this memory shared between the cards or each cuda core have access only to 1 gb? any ideas?


I think you can get away with a powerful gaming card. They can have now up to 3 GB of RAM and they have the same number of cores and the same speed as the TESLA cards. The 2 cards setup have additional issues, but it depends on the application. I would recommend a 1 GPU setup, less head aches.

I recommend you to use CUDA C. You might find that the PGI fortran requires so much learning that you would spend the same amount of time. It is very simple and you have more freedom. If you already have the code in fortran it will not be difficult to convert to CUDA C.

The speed up depends on the application. Write here some code in Fortran which reflects your problem and people can say more about how possible speed up and some suggestions how to do it. I am no expert in CUDA, but I can help you get started in CUDA C.

I also usually write programs in Fortran, but for CUDA I decided to use C.

Thanks for your reply. I am now more inclined with a 1 gpu setup as well, because of the power requirements of dual cards, maybe I can get a 2 gb GTX 460 card. 3 gb cards are around 500 and more and out of my range. People say more RAM on cheaper cards is a marketing gimmick, but it might just help me out.

My code will depend on the final model I will try to estimate, so I don’t have it in fortran right now. But just to give you an idea of something I will probably have to do:

  • evaluate a non-analytical function that involves a monte carlo integration (so I need to generate random variables (Gaussian)), let’s say quintuple, tens of thousands of times (with different arguments), then find the argmax of the those function evaluations. (like picking a maximum of a vector- I am so used to do something simple like this with fortran code or just matlab, I am freaking out on the prospect of writing this subroutine myself on a language I don’t know) This is just inside a loop, then I will have to repeat this until I converge to a maximum of another function which comes from what happened in the loop.

Maybe it is better to hold off until I can at least write pseudo code for this. I heard porting cpu code to gpu is not efficient, so I want to build this up from ground up, as much as I can at least.

Meanwhile, any sources to read, or examples to get started (other than the results of a an obvious google search) or free numerical libraries that use cuda to ease off my coding is much appreciated.


Hey, you should consider looking at the GTX 560TI 2GB by EVGA, its only a little bit more expensive then the 460 but with more cores and power.


Programming on 2 cards brings its own problems. C and FORTRAN are almost 1 to 1 correspondent. If you know how to make the program parallel in mpi you will be able to make it run as well on a gpu. In practice the part that you split in a mpi code will be split on the gpu, but now with a much larger number of processes running in the same time. I recommend this book It is easy to read and it has many examples. The first chapter show the basics and some fast optimization and at the end there is a chapter on how to program for multiple gpu.

thank you all for your answers. I already have a GTS250 with an older CUDA compute version, maybe I can start working on simple stuff. the problem is that guy does not do double precision. I am going to apply for 750 dollar research fund, if I can get it maybe I can buy a fancier card. If not, I will have to stick with a 460/560 with 2 gb ram.

is this board suitable for getting help on programming once I get going with the complicated programs, or is there any mailing list/forum I can get help from?

In C single precision is called “real” while double precision is called “double”. 1.0 for example is treated automaticallyin double precision while 1.0f implies it is single precision. It does not change the program too much and the examples were meant to run on all cards and many laptop cards do not support double precision. I think all math functions like log and sqrt are double precision while for single precision they are called logf and sqrtf.

Here is a list with all the supported gpu’s. The compute capability is important and for double precision you need higher or equal 1.3 (CUDA GPUs - Compute Capability | NVIDIA Developer). The two cards support up to 2.1 so they are ok. I am not sure about the price/power ratio, but I think that in some cases 2 cheaper cards might cost less than 1 top expensive card , while giving more flops (theoretical power).
You can see a comparison of cards here List of Nvidia graphics processing units - Wikipedia.

The 250 is 1.1 with no double preicision.

One more thing . If you run CUDA intensive you are not going to be able to use the same card for the normal operations since it will be busy with calculations and it will be unresponsive. I recommend to have on light card + the card/cards for computations.

Here is simple example of how to parallize a double loop::

for(int i = 0; i < 128; i++)
for(int j = 0; j < 1024; j++)
C[i * 1024 + j] = A[i * 1024 + j] + B[i * 1024 + j];

(for is a loop equivalent to do in fortran)

Each addition operation is completely independent and has no ordering requirements, and therefore can be performed by a different thread. To express this in CUDA, one might write the kernel like this:

global void mAdd(float* A, float* B, float* C, int n)
int k = threadIdx.x + gridDim.x * blockDim.x;

if (k < n)
    C[k] = A[k] + B[k];


Here, the inner and outer loop from the serial code are replaced by one CUDA thread per operation, and I have added a limit check in the code so that in cases where more threads are launched than required operations, no buffer overflow can occur. If the kernel is then launched like this:

const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / nthreads; // value determine by block size and total work


Then 256 blocks, each containing 512 threads will be launched onto the GPU hardware to perform the array addition operation in parallel. Note that if the input data size was not expressible as a nice round multiple of the block size, the number of blocks would need to be rounded up to cover the full input data set.

All of the above is a hugely simplified overview of the CUDA paradigm for a very trivial operation, but perhaps it gives enough insight for you to continue yourself

(Ripped off from here c++ - CUDA how to get grid, block, thread size and parallalize non square matrix calculation - Stack Overflow)

Ripped off or not, thanks a lot for this. I am torn apart between this and building a cheap cluster (with cheapest AMD quadcores, no hdd, monitor etc.) in addition to my current computers, which will cost almost the same as gpu+power supply upgrade. There is some opposition to this from my thesis advisor-he thinks it is too good to be true, and cluster way involves too much of a setup, but then I am back to the MPI/Fortran world. decisions, decisions, decisions… :)

I wish you the best of luck. I’m an undergrad doing research for a professor using CUDA. One of the reasons he’s letting me do it instead of his Masters students could be because of a similar attitude he gets from his peers (I don’t pay much attention to departmental politics. Not that I’m privy anyway.) In fact, I wound up working 3 jobs over the summer to buy a 560 Ti card (along with the rest of a custom rig) just to do the research.

All I can say is, from my experience writing a “proof of concept” code (extremely basic, no atomics/streams/constant mem/shared mem), the results were promising even when extrapolating them conservatively.

You may want to first download the CUDA software and experiment with it using the emulator mode, just to get a feel.
As for MC simulation, you may start from the Mersenne Twister for pseudo random number generation, which was in the CUDA SDK.
Best of luck with your project.

Just FYI, there is no longer an “emulator mode” in CUDA. If you want to run CUDA code without a device, you have to use Ocelot (Google Code Archive - Long-term storage for Google Code Project Hosting.), which can be a little tricky. It is easier just to find a cheap GeForce card instead.

Also note that GeForce cards give you 1/8 the performance in double precision (compared to single precision). If you have to use double I’m not sure how much good a GPU will do you unless you can spring for a Tesla. I really depends on your CPU, but using multi core + sse can get you almost as close. If you can get away with single precision, you can start learning on the gts250 and upgrade from there (newer cards are faster and have caches and a few other differences, but the basics are the same).

On the Fermi platform the double precision programs have only half performence loss compare to single precision. These are all the 4xx and 5xx desktop cards. There are enough differences between the cards to make a differene for the optimization. Better to start with the proper card I say if you decide to go with CUDA.

This isn’t true. Only Tesla (and Quadro) cards can process double precision at half the single precision rate. GeForce cards have double precision speed-capped as laughingrice mentioned.

Even if you need double precision, I think it would be useful to first learn/test CUDA on a cheap card before paying for a Tesla.

And if you eventually want to see the speed improvement with Tesla, Amazon EC2 will let you run your code on a system with two Tesla M2050 cards for ~ US$0.75 per hour (spot price).

I am sorry, I must say I give 90% cuda is not for you. GPU programming is for professional skillfull programmers with optimization expirence. For example, you simply could not debug your program normal way. How would you develop then? And it is smallest problem. Much better is to try to optimize your cpu code. First, what compiler are you using? What data types do you need? Float, double? Is your programm properly multithreaded? And of course AMD is not an option here.

Hello. Thanks for correcting I got the wrong impression. I just read the note in the Wikipedia in which says that the consumer products have certain features turned off. I apologize for bad information.

Well, for some applications, you want professional programmers, but “amateurs” can get useful work done with CUDA. I learned CUDA programming as a Ph.D. physics student, and I’ve taught CUDA programming to undergraduate and graduate physics students. We’re not implementing really novel parallel algorithms, but it still helps us get work done faster than a pure CPU implementation.

Actually, I find that CUDA makes it easier for inexperienced programmers to make efficient code. I seldom get anywhere near peak efficiency on a CPU, but I can often get within a factor of 2 or 3 of the maximum throughput (either memory or compute) on a CUDA device.

To be successful with CUDA as a non-professional, you do need to be selective about how you apply the technology. Not all problems have an easy solution with CUDA, and you have to do a little cost-benefit analysis to decide where your development effort is best allocated. In the 4 years I’ve been using CUDA, I think I have only written 5 non-trivial programs with it. These five programs are very useful to me, and I use at least one of them every day, but compared to all the code I’ve written over that time, CUDA is not the dominant component.