matrix reduction using cuda fortran and GPU

its Fedora core 14.
I don’t remember seeing errors, so should I install again??

whats the difference between PGI accelerator fortran and PGI fortran?? which one is good for cuda fortran compiling for linux?

Dolf

its Fedora core 14.
I don’t remember seeing errors, so should I install again??

That should be fine. Did you run the install script? Are you running the compilers out of the installed directory, not the directory where you unpacked the distribution package?

If you continue to have trouble, please send a note to PGI Customer Service (trs@proup.com). They are much better at diagnosing install issue then I am.

whats the difference between PGI accelerator fortran and PGI fortran?? which one is good for cuda fortran compiling for linux?

The compilers are the same, the difference is that the “Accerator” license allow you to use the PGI Accelerator features such as CUDA Fortran, CUDA-X86, OpenACC, and the PGI Accelerator Model. PGI fortran can only target x86 based systems.

  • MAt

Are you running the compilers out of the installed directory, not the directory where you unpacked the distribution package?

I think I found the problem, I was able to compile in the folder /opt/pgi successfully.
thanks Mat.

so now, if I have .f90 with cuda fortran code I want to compile, what linux switches I need to use to successfully make it work with max efficiently on a Tesla C1060??
what max block size I can use?
now I am using block size of (32,16,1) on my GeForce 460 v2.
thanks,
Dolf[/quote]

so now, if I have .f90 with cuda fortran code I want to compile, what linux switches I need to use to successfully make it work with max efficiently on a Tesla C1060??

-Mcuda=cc13, though you don’t really need the cc13. By default, we generate device code for multiple compute capabilities. So by adding cc13 you’re just minimizing a bit of code bloat.

what max block size I can use?

Run the command “pgaccelinfo” to see information about your device, including the max block size. For a C1060:

Maximum Threads per Block: 512
Maximum Block Dimensions: 512, 512, 64
Maximum Grid Dimensions: 65535 x 65535 x 1



now I am using block size of (32,16,1) on my GeForce 460 v2.

That will work, though I’ve found a 16x16 typically works better. Granted, the block size is problem dependent so you will want to experiment on what works best for you.

  • Mat

-Mcuda=cc13, though you don’t really need the cc13.

what does c13 switch do? how can I add that to Visual Studio 2010 property when I compile on windows?

Maximum Threads per Block: 512
Maximum Block Dimensions: 512, 512, 64
Maximum Grid Dimensions: 65535 x 65535 x 1

does that mean that I can use block of (512,512,1) in Tesla C1060 instead of (32,16,1) I am currently using in GeForce?

(32,16,1) That will work, though I’ve found a 16x16 typically works better.

what do you mean works better? does it mean faster? since that is the most important factor to my code.

many thanks,
Dolf

“cc13” sets which compute capability (CC) to target. You can find which device supports which CC here: CUDA - Wikipedia

does that mean that I can use block of (512,512,1) in Tesla C1060 instead of (32,16,1) I am currently using in GeForce?

No, the product of the three dimensions can not exceed the maximum threads per block (i.e. 512). So you can have a (512,1,1), or (1,512,1), or (32,16,1), or (8,1,64), etc, but not (512,512,1).

what do you mean works better? does it mean faster? since that is the most important factor to my code.

Using the maximum number of threads may not always produce the fastest code. It’s best to try a variety of schedules to see which one is optimal.

  • Mat

Perfect, I will try couple of schedules and see.

also Mat, I wanted to ask you if we can call a kernel, from another kernel, and how can we perform that.

so basically, I have two nested do loops, inside them I have a subroutine which have another two nested loops, I want to run all 4 loops into the GPU to increase speed.

please advice.

Dolf

also Mat, I wanted to ask you if we can call a kernel, from another kernel, and how can we perform that.

In CUDA Fortran, kernels can call “device” attributed routines. For dynamic parallelism, i.e. when one kernel calls another global kernel using the chevron syntax, we are in the process of adding this to the 13.x compilers (though you need a K20 device).

For OpenACC, you currently must inline routines (either manually or via compiler inlining) into compute regions (i.e. there is no true calling support). However, the proposed OpenACC 2.0 specification does add the “routine” directive to help aid in this and we are looking into ways to do this automatically by the compiler. However, these features wont be available till later this year.

  • Mat

Thanks Mat, I will keep looking for it.

cheers,
Dolf

Hi Mat,

Not sure if I asked that before, but I have something strange that does not make sense to me.
so I compiled a cuda fortran code on my machine which has the PGI compiler 12.3, and cuda tool kit 4 plus GeForce GTX 460 installed. my code runs fine on my machine.
on the other hand, I have another machine, which has Tesla C1060, cuda tool kit 5 installed. but when I run the same code here, for the first time it complained about cudart64_40_17.dll which I copied from my machine to this machine. so now it does not complain, but it hangs and shows nothing.

what could be the problem you think? what exactly I need to have installed in the machine with Tesla card in order to run the cuda fortran code??

thanks,
Dolf

Hi Dolf,

What type of CPU and Windows versions are in use on both systems? I’ve seen this type of behaviour on Win7 systems using a Sandy-Bridge (AVX enabled) CPU. Win7 didn’t begin supporting AVX until SP1.

Beyond that, I’m not sure.

  • Mat

I think you have a good point.
my machine (has the PGI compiler) have the following specs:

  1. core i7-2600 processor
  2. Windows 7 Ultimate SP1
  3. cuda tool kit 4.0
  4. GeForce GTX 460 v2 card

the other machine I want to run the code:

  1. core 2 due quad core processor (Q9650)
  2. Windows 7 Professional SP1
  3. cuda tool kit 5.0
  4. Tesla C1060 GPU card

so, could be the code I am compiling is for different amount of processors than the testing machine? how can I fix that?
is there a command in fortran to check no. of processors and divide the task between them?

what is the command in CMD that I use to get the specs of the Tesla card? I tried pgaccelinfo and it did not work since I dont have pgi compiler installed there.

thanks.
Dolf

so, could be the code I am compiling is for different amount of processors than the testing machine? how can I fix that?

You set the target processor flag (-tp) to use the lowest common CPU (-tp penryn-64), a generic CPU (-tp px-64), or a unified binary (-tp=sandybridge-64,penryn-64).

is there a command in fortran to check no. of processors and divide the task between them?

For host code, there is the auto-parallelization flag (-Mconcur) which will parallelize loops (if there are no dependencies).

what is the command in CMD that I use to get the specs of the Tesla card? I tried pgaccelinfo and it did not work since I dont have pgi compiler installed there.

While I haven’t used it on Windows, you may try NVIDIA’s smi utility: https://developer.nvidia.com/nvidia-system-management-interface

  • Mat