Can I use GPU for high throughput MDS code via PGI


I am writing as asking around different people locally was giving me back some very conflicting information. I don’t have any experience with GPU coding so fat and we are thinking to follow this path. I think this would be the best place to get some feedback from some experienced users. I currently have composed a fortran90 MDS code which I am using to simulate a thermodynamic system. The code needed years in the making and we realised that it has more potential to be explored for the next year or two. We were initially running on a large CPU cluster with more than 3k cores but this is no longer accessible. This code is not RAM heavy on a CPU (only a couple of MB) and the run time is relatively short (a couple of hours and the most demanding cases maybe at most a couple of days).

I need a high throughput platform (i.e. several tenths of thousands of different runs) in order to collect enough statistics and extrapolate a macroscopic behavior of my simulated system. We need the runs to finish in a relatively short interval so as we increment the study according to the results obtained (hence running for months at a time before being able to receive the results is not realistically possible). I need to run the same small MDS code several times without any real parallelism inside the code and hence we thought that running this on a GPU platform might be easy and cost effective to do the job given that we will not be getting into the specifics of real parallelism conundrums (timing the loops, distributing memory etc.) which we do not currently have the time to explore. Each version of the code needs to run, self contained and then return back a file of the order of 7-10MB where all of the results are recorded. There is a possibility to utilise the latest model 6 x T80 Tesla GPU cards with the fund agreement that we have in place. The equivalent funds will only get us something of the order of 100 CPU cores which would not do the job (we estimated that we need the equivalent of at least 3-4k CPU cores to get something useful out of the code but we do not have the resources for that).

If you have any experience in these types of simulations, can you please advise if this is something that could be done on a GPU platform using PGI to compile the fortran90 code? I understand that we still need to to modifications to the current code but these are going to be small for what we need it to do. Is it possible to run a self contained code several times on each of the GPU cores (or the equivalent of a core) of the tesla system and get these results? Memory wise I calculated that there will be enough memory to utilise at least half of the GPU cores available which is still orders of magnitude more simultaneous runs compared to 100 CPU cores that we could buy with the given funds.

I will be looking forward to receiving your responses.

Hi antss85,

If your code isn’t parallel at all, then using a GPU probably isn’t the right way to go. While a GPU has lots of cores, they are relatively light weight when compared to a CPU core. The performance comes from being able to parallelize the code across the cores. Also, you can’t run a binary on a single core. At best you could have an binary use a multi-processing unit (192 cores per multiprocessor, 13 multiprocessors per GPU, with two GPUs per K80), but using only 1 out of 192 cores, would waste a lot resources.

What works well for a GPU is an algorithm with a high amount of compute intensity and data parallelism. Think of a loop with a high trip count, lots of floating point computation, and each iteration of the loop can be computed independent of the other. Granted this is the ideal but other algorithms can achieve good performance as well, just not to the same degree.

If your program could exploit at least some degree of data parallelism, then I think you might have a chance of seeing good performance. Otherwise, you may want to look at other options.

Of course, I don’t know your program so it’s difficult to give good advice. What I would advise is to try using OpenACC and see if it’s the right fit our not. You can download the OpenACC Toolkit from NVIDIA ( which includes the PGI compilers, examples, and a free 90-day license for commercial use and a 1 year renewable license for academics.

  • Mat