Problems with FORTRAN Accelerator and subroutines

Hi! I am modifying a simple SWE (Shallow Water Equations) 1D in order to use PGI Accelerator 11.6 (and Visual Studio 2008) on a nVidia Tesla C2050.

The code has a structure like this:

CALL sub1
CALL sub2

DO i = 1,N

CALL sub3
...
CALL subM

END DO

I have used “!$acc reflected” and “!$acc region do” in every subroutine inside the DO loop (since those subroutines consist essentially of one or more DO loops):

MODULE mod3

CONTAINS

SUBROUTINE sub3

...

!$ acc reflected(arr1,...)

!$acc region do

DO i = 1,N

...

END DO

!$acc end region

Then I created a “!$acc data region copy” outside the main program DO loop to avoid continuos data transfer between the host and the device:

CALL sub1
CALL sub2

!$acc data region copy(arr1,...)

DO i = 1,N

CALL sub3
...
CALL subM

END DO

!$acc end data region

The results I obtain with the accelerated code are different from the non-accelerated code. It looks like values in the arrays are not being updated correctly. I also tried “mirror” and “update device” (bedore entering the main program’s DO loop) combination, but i get an error (about copying datas from host to device).

I’ve uploaded the zipped VS 2008 project on MediaFire since it’s a little bit long and it would have appeared as a mess on this post:

http://www.mediafire.com/?pobzr88jeaglzct


Thanks in advance for the help,

Nicola

Update: it looks like I solved the “results’ problem”. It was necessary to “!$acc update device” the first two arrays (Up, DxU) in the DO loop inside the main program and to remove the Accelerator’s directives inside the Delta module.

Now it’s time to increase the poor (sigh) performances of the code.

I’ve another question: which is a good strategy to improve performances on a DO loop with several IF statements inside? For example, a structure like this:

DO i = 1,N

...

  IF(i==1) THEN
     ...
  ELSE IF(i==N) THEN
    ...
  ELSE
    ...
    IF(clause1) THEN
      ...
    ELSE IF(clause2) THEN
      ...
    ELSE
      ...
    END IF

  END IF

...

END DO

Is there any chance of improving or do I need to think about a totally different approach? Sorry, if the question is so “basical”, but this is the first time I get into parallel-programming world.

Hi Nicola,

I’m assuming you mean that the do loop is within an accelerator region and the if statements are part of your device kernel.

While branching is allowed, it should be avoided since it can degrade GPU performance due to thread divergence. Threads within the same warp execute the same instructions at the same time, just on different data (i.e. SIMD). Hence, if different threads within the same warp take different branches, then they need to take turns issuing their instructions, causing slow downs. (Note a good primer on the NVIDIA CUDA threading model can be found at: http://www.pgroup.com/lit/articles/insider/v2n1a5.htm).

So for your code, the first “i==1” and “i==N” aren’t a big problem. Yes, you’ll see a slight slow down, but it only occurs in two warps so not a big deal. You could hoist these cases outside the do loop and then iterate from i+1 to N-1, but this requires them to be executed on the host or in a serial device kernel. Probably not worth it, but you can experiment.

The interior if statement is more problematic. If you can arrange the data and schedule so that all the threads in a warp execute the same branch, then it’s fine. Otherwise, you’ll lose performance.

  • Mat

Hi Mat, I took some time in order to try different ways with PGI Accelerator and with PGI CUDA FORTRAN. I encountered some problem with both alternatives:

  1. PGI Accelerator: I used the mirror/reflected clause in order to call the different subroutines. These routines are called inside a do loop which is located inside the main program. I want to speed-up the code by using an accelerator region which contains that do loop but the compiler gives me several errors due to the presence of some update clauses and calls to the routines:
DO i = 1,N

	Up	= U
	DxU	= 0.0d0
	
	!$acc update device(Up,DxU)

	Up	= U
	DxU	= 0.0d0

	! Calcolo del passo temporale

	CALL Deltat(Dry,U,W,ULR)
	
	CALL Slopes(eta,Dry,U,DxU)

	CALL Predictor(zc,DxZ,eta,Dry,U,DxU,Up)

	CALL Reconstruction(zf,eta,Dry,U,DxU,ULR)

	CALL Fluxes(Dry,F,ULR)

	CALL SourceTerms(zc,zf,Dry,U,S0,Sf)

	CALL ComputeDU(Dry,F,S0,Sf,DtU)

	CALL Update(zf,eta,W,U,Dry,DtU)

	IF(t>tsim) i = N

END DO

Note that i changed the do-while loop into a do loop with an if statement in order to run it with the Accelerator.

  1. PGI CUDA FORTRAN: in this version of the code I used a call to a global kernel subroutine (from the main program) which contains several calls to device kernel subroutines. The problem is that I get no computed values from those routines.

If you want to look at the second code, I have uploaded it on MediaFire:

http://www.mediafire.com/?ccl6fstjupr1hep



Many thanks for your help,

Nicola.

Hi nicolaprandi,

  1. PGI Accelerator: I used the mirror/reflected clause in order to call the different subroutines. These routines are called inside a do loop which is located inside the main program. I want to speed-up the code by using an accelerator region which contains that do loop but the compiler gives me several errors due to the presence of some update clauses and calls to the routines:

What errors are you getting?

  1. PGI CUDA FORTRAN: in this version of the code I used a call to a global kernel subroutine (from the main program) which contains several calls to device kernel subroutines. The problem is that I get no computed values from those routines.

I took a look at the code and see a few problems. First, if you add ‘implicit none’ to your Kernel you’ll see that you’re missing the Nc, Nf, and ndt variables. Second, I don’t see where you copy back any data back from the device.

  • Mat

Hi Mat, thanks for the answer.

I get the following errors:

     63, Accelerator restriction: function/procedure calls are not supported
     70, Accelerator restriction: function/procedure calls are not supported

Where did I go wrong?

One more question. I don’t need to use an update clause after these lines:

	Up	= U
	DxU	= 0.0d0

	!$acc update(Up,DxU)

do I? Since the code is inside a compute region, the sequential part should already be executed on the device.

I fixed the missing variables and uncommented the copy of the array Out (which should contain the value I need) from the device to the host. In particular, the values I need are the ons from the following arrays:

xc(i), zc(i), W(i,1), zc(i)+W(i,1), W(i,2)

Again, I uploaded the two VS projects on MediaFire:

Where did I go wrong?

You can’t make calls within compute region. You need to simply remove the “!$acc region/end region” from this section of code and just use the data region. This will also allow you to put the while loop back in.

The data region should be changed to use ‘local’ instead of ‘mirror’’. ‘mirror’ mirrors the allocation status of allocatable arrays. Since you allocated these arrays earlier in the program, you need use local instead.

Finally, instead of using the “update” clause which updates host/device memory, you should put “Up=U” and “DxU=0.0d0” in an accelerator region. These are device variables, so need to be executed on the device.

!$acc data region local(zc,zf,DxZ,eta,Dry,W,U,ULR,DxU,Up,F,S0,Sf,DtU)

N = 10000                                       ! This is just a start value for N (it should be much higher)

DO i = 1,N                                      !WHILE(t<tsim>tsim) i = N

END DO

!$acc end data region

For the CUDA Fortran version, your grid X dimension is in correct. You have it as “Nf/Nt”. However, since Nf is 501 and Nc is 200, X is 0. To correct, use “(Nf+Nt-1)/Nt”. However, in your kernel you will need to add some type of guard so that any extra threads don’t execute. Otherwise your kernel will get an out of bounds error.

You will get a launch error since your program uses a lot of registers. You will need to reduce the the number of threads per block from 512 to something like 64. Use “-Mcuda=ptxinfo” to see the number of registers being used.

Looks like there are other issues as well, but I’ll leave those to you.

Note that CUDA Fortrran has an emulation mode (-Mcuda=emu) which allows you to compile with “-g” and debug your code in the PGI debugger, PGDBG. This what I used to figure out the above error.

Also, I would add error checking. Add the following code after your kernel launch:

ierr = cudagetlasterror()
if( ierr .ne. 0 )then
  print *,'last error is ', ierr
  print *,  cudageterrorstring( ierr )
 endif

Hope this helps,
Mat

Hi Mat, thanks for your reply.

I tried to enclose the main do loop inside a compute region in order to improve the performances of the code. Before doing that, the code ran correctly but very slow (even slower than the code on the CPU). Should I remove the calls to the subroutine in order to accelerate the whole code (including the main loop) or is there a simpler approach (I mean simpler as “less confusionary”)?


I have encountered several problems when i tried to run PGDBG:

  1. On its manual I haven’t found the way to start it when debugging in VS Shell (I guess there is a flag other than -g, right?);
  2. I tried to launch it from the x64 PGI command shell (I’m on w7 x64) and it appears I miss a file: jpgdbg.jar. I looked in the PGI/…/bin directory: the file is missing (I tried to execute the profiler and it’s working, in fact there is the file jpgprof.jar). I re-installed the PGI suite but the file is still missing. What have I to do?
  3. I used the VS Shell debugger (F5) along with the -g flag and fixed some mistakes but it doesn’t show the correct values on device arrays. I guess I should use PGDBG, should I?


    Thanks again, Nicola.

Hi Nicola,

Should I remove the calls to the subroutine in order to accelerate the whole code (including the main loop) or is there a simpler approach (I mean simpler as “less confusionary”)?

Can the value of Nc get larger? The current value of 512 only uses a single block and is why your performance is so poor. If I change Nc to 32768, my GPU performance is about 4 times faster then the CPU.

Performance would get better if you can accelerator the outer time step loop. Though, to do this you would need to inline the called routines (either by hand or via the compiler flag -Mipa=inline). However, I do see several issues that would need to be fixed before the compiler can parallelize the outer time step loop. For example, the Up, ULR, and DxU variables need to be privatized since each iteration of the time step loop uses their own unique values. You need to remove the “Output” routine since it uses I/O. You need ot remove the “IF(t>tsim) i = N” since this causes a dependency.

Since your code is small enough, I would try commenting out all the calls, putting an ACC region around the time step loop, and then, one at a time, manually inline each routine. Use the compiler informational messages (-Minfo=accel) to help you understand where the problems are.

I have encountered several problems when i tried to run PGDBG:

Sorry, I forgot you are using PVF. When you run using the VS debugger (F5), your really using PGDBG, just with a VS GUI interface. No need to run PGDBG by itself.

  1. I used the VS Shell debugger (F5) along with the -g flag and fixed some mistakes but it doesn’t show the correct values on device arrays.

I’m not sure since it should.

One thing to think about is that you really want to have a large amount of parallelism. In your CUDA Fortran code, you simply moved the time step loop into a kernel and launch a single block of 64 threads. Instead, what you want to think about is how you can have each kernel be a single iteration of the time step loop. Then the 10000 loop iterations could be turned into 10000 threads.

A kernel is just a serial bit of code that gets replicated across many threads. Each thread executes the same code, just on different data.

  • Mat

Hi again! I worked a little bit more on the CUDA FORTRAN code and I obtained the same results as the one given by the original program (which uses the CPU) after one time-step: that’s great! But when I try to run more than a few time-steps in emulation mode, the program crashes (I used a do loop with a “c” index in place of the do-while loop in order to test the program). If I remove the emulation mode’s flag, the compiler gives me the following error:

...AppData\Local\Temp\pgcudafor2aBeDrVBDN9Jm.gpu(4): Error: Formal parameter space overflowed (256 bytes max) in function kernel

Any suggestion about it? As the usual, I’ve upped the latest version of my code on MediaFire:

http://www.mediafire.com/?f87o2dc2jkq6lx8


Thanks, Nicola.

Hi Nicola,

I obtained the same results as the one given by the original program (which uses the CPU) after one time-step: that’s great!

Unfortunately, I think this is just by luck. When I run your code I get an error in Fluxes_kernel when accessing the ULR arrays. You initialize this array in Reconstruction_kernel, but since in my case not every thread has executed, parts of ULR contain garbage by the time its accessed in Fluxes_kernel.

There isn’t any global synchronization in CUDA except between different kernel calls. There is limited kernel synchronization, but only between threads in the same block. You’ll need to revisit your algorithm and try and remove dependencies. Also remember that different blocks can be executing different time steps so you can’t reinitialize your global arrays. You’ll need to try and privatize them (i.e. each thread has it’s own copy or add an extra dimension for each time step). Does the values of ULR and the other global arrays change from time step to time step? If not, try pre-computing them.

Error: Formal parameter space overflowed (256 bytes max) in function kernel

CUDA limits the amount of data that can be passed in via arguments. Your over this limit. Instead of passing in all your arrays, declare them in your Main_Kernel module. They can then be accessed directly by all device routines in your module so no need to pass them in.

Hope this helps,
Mat

Thanks Mat, tomorrow I’ll check the code and try to fix the access to the ULR array. While going back home, I came up with some other questions:

  • About general synchronization. As you said, there is no general procedure for doing so except with different kernel calls. If I want to achieve it, do I need to move the subroutines outside the main kernel and create a kernel for each one of those? By doing so, would I lose some performances?

  • The do-loop (with index “it”) should complete the whole “Nt” operations within the single thread block. How this is supposed to work with the (Nf+Nt-1)/Nt = 9 threads? Let’s suppose the code would work as it is, will it execute the Nt operations for 9 times?

  • I noted a strange behavior by running the code in debug mode (with emulation flag). It executes the main loop with index “c” equal to 1 for about 13-14 threads (index “it”), then changes to c = 2 and it startes over from 1. Shouldn’t it reach it = 64 before having c = 2? Does this happen because there is no thread block synchronization?


    Thanks in advance for answering, have a good day.

Nicola

If I want to achieve it, do I need to move the subroutines outside the main kernel and create a kernel for each one of those?

If you can’t remove the dependencies, then yes, this is the only way to guarantee global synchronization.

By doing so, would I lose some performances?

Yes, but how much will depend upon the the size of Nc. Right now it’s relatively small so can’t take full advantage of the GPU. If Nc was very large, then you may not loose as much.

The do-loop (with index “it”) should complete the whole “Nt” operations within the single thread block. How this is supposed to work with the (Nf+Nt-1)/Nt = 9 threads? Let’s suppose the code would work as it is, will it execute the Nt operations for 9 times?

Sorry, I’m not sure what you’re asking here. “Nt” defines the number of threads. " (Nf+Nt-1)/Nt" defines the number of blocks. Given Nt is 64 and Nf is Nc (512) + 1, or 513, you will have 9 blocks of 64 threads each.

I noted a strange behavior by running the code in debug mode (with emulation flag). It executes the main loop with index “c” equal to 1 for about 13-14 threads (index “it”), then changes to c = 2 and it startes over from 1. Shouldn’t it reach it = 64 before having c = 2? Does this happen because there is no thread block synchronization?

It’s more likely an artifact of how emulation mode is performed. We use OpenMP tasks to simulate device threads. Upon launch of your code, N number of OpenMP threads (N defaults to the number cores on your system). When you launch a kernel, a OpenMP task is added for each device thread to a queue. Then each OpenMP thread will start to execute a device thread from this queue. When a “syncthreads”, is encountered, then the device thread is put back on the queue and a new device thread starts up. So what you’re actually looking at when debugging, are OpenMP threads which will be executing one or more device threads.

  • Mat

Hi again, during these days I progressed a little bit and finished to write a working CUDA FORTRAN code: it was necessary to remove the outer kernel in order to synchronize the datas between the inner kernels.

I have some question involving optimization:

  • If I have:
U_Dev = U

is the same thing as having a kernel which copies every value from U to U_Dev? What I’d like to know is if there is only one thread working or if the compiler spawns many threads to manage the process.

  • If I have blocks with less than 64 values, only one or two blocks are executed on each MultiProcessor? I have found no clear explanation about it, it looks like “Several concurrent blocks can reside on one SM depending on the blocks’ memory requirements and the SM’s memory resources” (from nVidia’s slides).


    Thanks, Nicola.

s the same thing as having a kernel which copies every value from U to U_Dev? What I’d like to know is if there is only one thread working or if the compiler spawns many threads to manage the process.

This is a host to device copy so there are no device threads in use.

Since the data is contiguous, the host to device memory copy is performed in a single DMA transfer. In other words, the data is moved in one big block of memory, not element by element. When copying sub-sections of arrays, then the compiler may need to create multiple DMA transfers.

If I have blocks with less than 64 values, only one or two blocks are executed on each MultiProcessor?

The number of blocks executed at one time on a mutilprocessor (SM) will depend on how many resources each block uses. Each SM has a fixed number of registers and shared memory space. The more registers used per thread and more shared memory used per block, the fewer number of the threads and blocks that can be executed on an SM.

The amount of registers and shared memory varies by device. To see what’s available on your device, see the output of ‘pgaccelinfo’. To see how many resources your program uses, add the flag “-Mcuda=ptxinfo”.

A useful exercise for you might be to try and calculate your program’s occupancy using NVIDIA’s occupancy calculator. (http://forums.nvidia.com/index.php?showtopic=31279)

  • Mat

Thanks for the explanation. Today I finished writing the SWE1D code and got a 26x speedup versus the unparallelized CPU version: tomorrow I’ll try to add some tweaks. For the moment, I have another small question regarding the exceeding threads spawning during kernel calls. Would it be of any convenience to create two different kernel calls for each kernel, the first one to manage the fully-occupied thread blocks and the second one to manage the partially occupied thread block (which obviously would be fully occupied by changing the block size)?

I think it would be no good when the thread blocks’ number is not a multiple of the SMs’ number, since we would require one more step to spawn the last thread block.


Thanks again, Nicola.

Hi Nicola,

Today I finished writing the SWE1D code and got a 26x speedup versus the unparallelized CPU version:

That’s great!

Would it be of any convenience to create two different kernel calls for each kernel, the first one to manage the fully-occupied thread blocks and the second one to manage the partially occupied thread block (which obviously would be fully occupied by changing the block size)?

Often it does take experimentation to determine the best schedule, so I encourage you to try different ones. In this case though, I doubt you’ll see much difference. So long as you have many blocks, having one with a few threads that don’t do work, shouldn’t matter much.

  • Mat

Hi Mat, thanks for the answer. I tweaked a little bit further the code by implementing reduction (as shown in Mark Harris’ slides for CUDA C language: http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf) and several other minor changes in order improve the MIN calculation of a very large array: by doing so, I reached a 40x speedup.

New day, new question: is it possible to export a value from a kernel to the host when (inside the kernel) we meet a certain condition?

In particular, what I would like to achieve is the following: I pass the variable “t” inside a kernel, update it by adding the variable “Dt” (calculated in a previous kernel) and change the result if “t” is greater than “tsave”. If that condition applies, I need to put a flag scalar equal to 1 in order to save the data that I need later on inside the do-loop calling the kernels. Is there a possibility to export that flag scalaronly if “t” is greather than “tsave”? If I export the variable on the host each time, the code’s performances are cut down by 20-30% since it makes a lot of memory access operations.

I tried to mix CUDA FORTRAN w/PGI Accelerator (by using the mirror-reflected directives) with no success. I also tried if the cudaEvent functions but it looks like they could only be executed from the host: did I do something wrong with them?


Have a good day, Nicola.

is it possible to export a value from a kernel to the host when (inside the kernel) we meet a certain condition?

There isn’t a direct way in CUDA Fortran to have a kernel ‘push’ data back to host. Data movement is controlled by the host.

I tried to mix CUDA FORTRAN w/PGI Accelerator (by using the mirror-reflected directives) with no success.

You can mix the two and the PGI Accelerator model can access CUDA Fortran device variables. However, support for CUDA Fortran accessing PGI Accelerator model device variables is still in process. If you have an example of what you are try to do, I can put in a feature request.

Thanks,
Mat