Operators both on host and device functions

I’m using vector class on the host system that looks like this (plus functions for the other operators):

module VectorClass
implicit none

    type :: Vector
        Real( 8 ) :: x
        Real( 8 ) :: y
        Real( 8 ) :: z
    end type Vector

    interface operator ( + )
        module procedure VectorAdd
    end interface

contains

    type ( Vector ) Function VectorAdd( v1, v2 )
    implicit none
        type ( Vector ), intent( in ) :: v1
        type ( Vector ), intent( in ) :: v2

        VectorAdd % x = v1 % x + v2 % x
        VectorAdd % y = v1 % y + v2 % y
        VectorAdd % z = v1 % z + v2 % z

    end function VectorAdd

end module VectorClass

Now I’d like to use both the vector type and the operators (+, * and so on) both on the host and the device. Is that possible at all?

When defining VectorAdd with “attributes (device,host)” is just stops working for the rest of the host program. I also tried writing a VectorAddGPU function but when passing a vector to it I get a type mismatch error.
Is there a way around it or do I have to define completely separate vector types for my host and device parts of the program?

Hi Pschmidt,

Unfortunately, the combined “device,host” and “global,host” attributes turned out to be very difficult to implement. While we haven’t given up yet, we may need to drop this from the CUDA Fortran specification. So for now and in the foreseeable future, you’ll need to write two separate functions, one for the device and one for the host.

Sorry,
Mat

module VectorClass
implicit none

    type :: Vector
        Real( 8 ) :: x
        Real( 8 ) :: y
        Real( 8 ) :: z
    end type Vector

    interface operator ( + )
        module procedure VectorAdd
    end interface

contains

    type ( Vector ) Function VectorAdd( v1, v2 )
    implicit none
        type ( Vector ), intent( in ) :: v1
        type ( Vector ), intent( in ) :: v2

        VectorAdd % x = v1 % x + v2 % x
        VectorAdd % y = v1 % y + v2 % y
        VectorAdd % z = v1 % z + v2 % z

    end function VectorAdd

    type ( Vector ) attributes(device) Function VectorAddGPU( v1, v2 )
    implicit none
        type ( Vector ), intent( in ) :: v1
        type ( Vector ), intent( in ) :: v2

        VectorAddGPU % x = v1 % x + v2 % x
        VectorAddGPU % y = v1 % y + v2 % y
        VectorAddGPU % z = v1 % z + v2 % z

    end function VectorAddGPU

end module VectorClass

This is the way it would have to be implemented, right? That way I could use the “+” operator on the host code and I’d have to call the VectorAddGPU function manually.
This still produces a type mismatch every time I try to call the device function.
Do I also have to declare a new type on the device?

This is the way it would have to be implemented, right? That way I could use the “+” operator on the host code and I’d have to call the VectorAddGPU function manually.
This still produces a type mismatch every time I try to call the device function.

Device functions are only callable from other device routines. Host code can only call global subroutines via the chevron syntax. How are you trying to call VectorAddGPU?

For example:

% cat test.cuf
module VectorClass
implicit none

    type :: Vector
        Real( 8 ) :: x
        Real( 8 ) :: y
        Real( 8 ) :: z
    end type Vector

    interface operator ( + )
        module procedure VectorAdd
    end interface

contains

    type ( Vector ) Function VectorAdd( v1, v2 )
    implicit none
        type ( Vector ), intent( in ) :: v1
        type ( Vector ), intent( in ) :: v2

        VectorAdd % x = v1 % x + v2 % x
        VectorAdd % y = v1 % y + v2 % y
        VectorAdd % z = v1 % z + v2 % z

    end function VectorAdd

    attributes(device) subroutine VectorAddGPU( v1, v2, v3 )
        implicit none
        type ( Vector ), intent( in ) :: v1
        type ( Vector ), intent( in ) :: v2
        type ( Vector ), intent( out ) :: v3

        v3 % x = v1 % x + v2 % x
        v3 % y = v1 % y + v2 % y
        v3 % z = v1 % z + v2 % z

    end subroutine VectorAddGPU

    attributes(global) subroutine foo (v1,v2,v3,N)
        type ( Vector ) :: v1(N),v2(N),v3(N)
        integer, value :: N,i

        i = (blockIdx%x-1)*blockDim%x + threadIdx%x
        call VectorAddGPU(v1(i),v2(i),v3(i))
    end subroutine foo

end module VectorClass

program testme

    use VectorClass
    integer,parameter :: N=256
    type(Vector), dimension(N) :: v1,v2,v3
    type(Vector), dimension(N),device :: d1,d2,d3

    do i=1,N
       v1(i)%x=1.0d0
       v1(i)%y=1.0d0
       v1(i)%z=1.0d0
       v2(i)%x=2.0d0
       v2(i)%y=2.0d0
       v2(i)%z=2.0d0
    end do
    d1=v1
    d2=v2
    call foo<<<N/32,32>>>(d1,d2,d3,N)
    v3=d3
    print *, v3(1)%x

end program testme
% pgf90 test.cuf -V10.9; a.out
    3.000000000000000

Hope this helps,
Mat

That fixes the function call but not the type mismatch.
I use a set of global variables which are declared in a global variable module using the attributes(device) tag.
When passing variables from the VectorClass module to the subroutine it works but when passing variables from the global variables module (which is used in the VectorClass) it tells me that there are type mismatches, it won’t accept those variables.
So “call VectorAddGPU(x, y, z)” works but “call VectorAddGPU(x, global_y, global_z)” will produce type mismatch errors for gobal_y and global_z (despite them being declared as device variables and the module being used).
How do I avoid that?

Hi Pschmidt,

Do you mind posting a reproducing example?

Thanks,
Mat

module calculation_on_gpu
use vector_module ! in here the vector type is defined
use global_variables_module
implicit none

contains

attributes(global) subroutine calculate_my_stuff
implicit none

type (Vector) :: a

call VectorAddGPU( a, global_variable_1, global_variable_2 )

end subroutine calculate_my_stuff

attributes(device) subroutine VectorAddGPU( res, v1, v2 )
implicit none
        type ( Vector ), intent( in ) :: v1
        type ( Vector ), intent( in ) :: v2
        type ( Vector ), intent( out ) :: res

        res % x = v1 % x + v2 % x
        res % y = v1 % y + v2 % y
        res % z = v1 % z + v2 % z

end subroutine VectorAddGPU

end module calculation_on_gpu



module global_variables_module
use vector_module
implicit none

        type (Vector) :: several_host_variables
        type (Vector),device :: global_variable_1, global_variable_2

end module global_variables_module

This will produce a type mismatch for the two Vectors declared in the global variables file. The local Vector “a” seems to work fine, the global ones don’t.
Using host variables from the global file in a host routine works perfectly fine. I guess it’s a problem with device and host variables using the same type?

Hi Pschmidt,

Currently, module device data must be in the same module as where it’s used. The problem is that there isn’t a linker for GPU code, hence no way to associate GPU symbols from one object to another.

In the coming 11.0 release due this November, we will have a way to essentially do the module data linking at run time. (See http://www.pgroup.com/lit/articles/insider/v2n3a1.htm for details). The caveat being it will only be available on “Fermi”.

In the mean time, to fix the code simply move the declarations of global_variable_1 and global_variable_2 in the calculation_on_gpu module.

% cat psc.cuf
module vector_module

    type :: Vector
        Real( 8 ) :: x
        Real( 8 ) :: y
        Real( 8 ) :: z
    end type Vector

end module vector_module


module global_variables_module
    use vector_module

   type (Vector) :: several_host_variables

end module global_variables_module

module calculation_on_gpu
  use vector_module ! in here the vector type is defined
  use global_variables_module

implicit none

        type (Vector),device :: global_variable_1, global_variable_2

contains

attributes(global) subroutine calculate_my_stuff
implicit none

type (Vector) :: a

call VectorAddGPU( a, global_variable_1, global_variable_2 )

end subroutine calculate_my_stuff

attributes(device) subroutine VectorAddGPU( res, v1, v2 )
implicit none
        type ( Vector ), intent( in ) :: v1
        type ( Vector ), intent( in ) :: v2
        type ( Vector ), intent( out ) :: res

        res % x = v1 % x + v2 % x
        res % y = v1 % y + v2 % y
        res % z = v1 % z + v2 % z

end subroutine VectorAddGPU

end module calculation_on_gpu

% pgf90 -c psc.cuf -V10.9
%
  • Mat

Ok I seem to be closing in on the problem ;)

When using this all-in-one-module approach (it’s pretty messy …) I also need a particle type which contains vectors:

module vector_module

    type :: Vector
        Real( 8 ) :: x
        Real( 8 ) :: y
        Real( 8 ) :: z
    end type Vector

    type :: Particle
        type (Vector) :: position
        type (Vector) :: velocity
    end type Particle

end module vector_module

When calling the VectorAddGPU function with “some_particle % position” as argument I still get the type mismatch (“some_particle” is of course declared within the module as device variable).

Funny enough when I define the type like this:

    type :: Particle
        type (Vector),allocatable,device :: position
        type (Vector),allocatable,device :: velocity
    end type Particle

I don’t get the type mismatch. But I don’t really want them to be allocatable (they have fixed sizes after all) and I don’t really understand why I have to include the device attribute within the type declaration (it keeps me from using the same types on both the host and device code).

This actually just got worse:
After working out all other (fortran) compiler errors I now get:

/tmp/pgcudaforR7qbNWYM8Hi.gpu(100): error: argument of type "signed char" is incompatible with parameter of type "signed char *"

So obviously the Fortran code compiles well but the temp files then produce a C error message (it looks like C, right?) because the allocatable attribute in the type definition screws stuff up.
On the other hand I need the allocatable attribute or the gpu function will produce the type mismatches described above.

Is there any way out of this? Or is it simply impossible to use types on the GPU like that?

Hi Pschmidt,

We did have a known issue with character constants (See:Character Constant Error and GPU Machine Constant Question) that was fixed in the 10.9 release. However, the error you’re getting is different and I do not see any other previously reported problems that match it.

Is there any way out of this? Or is it simply impossible to use types on the GPU like that?

I’m not sure since this is the first report of this error. Sorry to have to ask for this again but we’ll need a reproducing example in order to determine the problem. If the source is too big to be posted here, please send it to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me.

Thanks,
Mat

It’s pretty much the same as the code posted above:

module vector_module

    type :: Vector
        Real( 8 ) :: x
        Real( 8 ) :: y
        Real( 8 ) :: z
    end type Vector

    type :: Particle
        type (Vector) :: position
        type (Vector) :: velocity
    end type Particle

end module vector_module


module global_variables_module
    use vector_module

   type (Vector) :: several_host_variables

end module global_variables_module

module calculation_on_gpu
  use vector_module ! in here the vector type is defined
  use global_variables_module

implicit none

        type (Vector),device :: global_variable_1, global_variable_2
        type (Particle),device :: particle1, particle2

contains

attributes(global) subroutine calculate_my_stuff
implicit none

type (Vector) :: a

call VectorAddGPU( a, particle1 % velocity, global_variable_2 )

end subroutine calculate_my_stuff

attributes(device) subroutine VectorAddGPU( res, v1, v2 )
implicit none
        type ( Vector ), intent( in ) :: v1
        type ( Vector ), intent( in ) :: v2
        type ( Vector ), intent( out ) :: res

        res % x = v1 % x + v2 % x
        res % y = v1 % y + v2 % y
        res % z = v1 % z + v2 % z

end subroutine VectorAddGPU

end module calculation_on_gpu

This will give a type mismatch.

I can change line 31 into

    type (Particle),allocatable,device :: particle1, particle2

(actually I’m using a particle array that needs to be allocatable since the length in determined in run-time but the problem can be reproduced even without the array)
and use this type:

    type :: Particle
    type (Vector),allocatable,device :: position
    type (Vector),allocatable,device :: velocity
    end type Particle

which generally doesn’t make much sense in my opinion but it at least gets rid of the type mismatch - but then it produces the error “argument of type “signed char” is incompatible with parameter of type “signed char *””.

Hi Pschmidt,

which generally doesn’t make much sense in my opinion but it at least gets rid of the type mismatch - but then it produces the error “argument of type “signed char” is incompatible with parameter of type “signed char *””.

I think we should ignore this problem since it has to do with the allocatable being applied to the vector type within the particle type. As you point out, it doesn’t make much sense to do this so let’s focus on the type mismatch.

In the current version of CUDA Fortran, all but simple derived type are known to be supported. For more complex types, such as types within types, there are known issues. The good news is that we are working on it and using our internal development compiler I can compile your code as is.

As a work-around until this support is available in November’s 11.0 release, try using a local vector type, copy “particle1%velocity” to this local vector, and then pass the local vector to VectorAddGPU.

 
% cat test2.cuf
module vector_module

    type :: Vector
        Real( 8 ) :: x
        Real( 8 ) :: y
        Real( 8 ) :: z
    end type Vector

    type :: Particle
        type (Vector) :: position
        type (Vector) :: velocity
    end type Particle


end module vector_module


module global_variables_module
    use vector_module

   type (Vector) :: several_host_variables

end module global_variables_module

module calculation_on_gpu
  use vector_module ! in here the vector type is defined
  use global_variables_module

implicit none

     type (Vector),device :: global_variable_1, global_variable_2
     type (Particle), device :: particle1, particle2

contains

attributes(global) subroutine calculate_my_stuff
implicit none

type (Vector) :: a,b

#ifdef WORK_AROUND
b=particle1%velocity
call VectorAddGPU( a, b, global_variable_2 )
#else
call VectorAddGPU( a, particle1%velocity, global_variable_2 )
#endif

end subroutine calculate_my_stuff

attributes(device) subroutine VectorAddGPU( res, v1, v2 )
implicit none
        type ( Vector ), intent( in ) :: v1
        type ( Vector ), intent( in ) :: v2
        type ( Vector ), intent( out ) :: res

        res % x = v1 % x + v2 % x
        res % y = v1 % y + v2 % y
        res % z = v1 % z + v2 % z

end subroutine VectorAddGPU

end module calculation_on_gpu
% pgf90 test2.cuf -c -Mpreprocess -V10.9
PGF90-S-0188-Argument number 2 to vectoraddgpu: type mismatch (test2.cuf: 45)
  0 inform,   0 warnings,   1 severes, 0 fatal for calculate_my_stuff
% pgf90 test2.cuf -c -Mpreprocess -V10.9 -DWORK_AROUND
%
  • Mat

This does not work for me. Executing exactly the code you posted I get:

PGF90-S-0000-Internal compiler error. unsupported terminal operator     264 (test2.cuf: 48)

In my code every line that tries what you suggested (local_vector = some_type_with_vector % the_vector) will produce an additional error message like that.

I tried both the 10.8 and 10.9 release and it doesn’t worth with either.

Hi Pschmidt,

Sorry about that. This appears to a 32-bit only issue (I was compiling in 64-bits) when copying “b=particle1%velocity”. Look like we can work around this by copying each element individually:

#ifdef WORK_AROUND
b%x=particle1%velocity%x
b%y=particle1%velocity%y
b%z=particle1%velocity%z
call VectorAddGPU( a, b, global_variable_2)
#else
call VectorAddGPU( a, particle1%velocity, global_variable_2 )
#endif
  • Mat

Thanks, that works. I got code that compiles correctly now. Sadly there seems to be a problem with the memory allocation (which only produces errors on runtime, not during compiling).
That’s why I got one more question about the vectors:
Assuming I have a dynamic array of vectors:

type ( Vector ), allocatable, dimension( : ),device :: device_vector

How do I allocate that properly? The manual says that both the normal Fortran allocate and the cudaMalloc command only work with the intrinsic data types. When using “normal” Fortran style allocation

allocate( device_vector ( length ) )

it compiles just fine but it tells me

0: ALLOCATE: 0 bytes requested; status = 0(no error)

and stops the program.
Is there a way around it? cudaMalloc also can’t handle the vector types.

The problem is actually worse than I thought. I followed the memory allocation problem and this is what I got:

program my_program
use gpu_calculation
implicit none

call some_init_stuff

end

module gpu_calculation
implicit none
save

Integer,device :: number

contains

subroutine some_init_stuff

    number = 5

end subroutine some_init_stuff

This will produce a segmentation fault. The line “number = 5” is what breaks. Any time I try to actually write to any of the device variables declared in the module there will be a segmentation fault, even for the statically declared variables.

Any ideas?

Hi Pschmidt,

I needed to move your module before the main program to get it compile, but the program seemed to work fine.

Most likely, there is something wrong with your GPU setup. Are you using the CUDA libraries that accompany the PGI compilers? What is the output of ‘pgaccelinfo’?

  • Mat
% cat test.cuf
module gpu_calculation
implicit none
save

Integer,device :: number

contains

subroutine some_init_stuff

    number = 5

end subroutine some_init_stuff

end module gpu_calculation

program my_program
use gpu_calculation
implicit none
integer hn
call some_init_stuff
hn=number
print *, hn
end

% pgf90 test.cuf -V10.9 ; a.out
            5
/opt/pgi/linux86/10.9/bin$ ./pgaccelinfo 
CUDA Driver Version:           3010

Device Number:                 0
Device Name:                   GeForce GTX 275
Device Revision Number:        1.3
Global Memory Size:            938803200
Number of Multiprocessors:     30
Number of Cores:               240
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           16384
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512, 512, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          2147483647B
Texture Alignment:             256B
Clock Rate:                    1404 MHz
Initialization time:           1628 microseconds
Current free memory:           734973440
Upload time (4MB):             1120 microseconds (1310 ms pinned)
Download time:                 2842 microseconds ( 821 ms pinned)
Upload bandwidth:              3744 MB/sec (3201 MB/sec pinned)
Download bandwidth:            1475 MB/sec (5108 MB/sec pinned)
                                                                                                                                                                                    
Device Number:                 1                                                                                                                                                    
Device Name:                   GeForce GTX 275                                                                                                                                      
Device Revision Number:        1.3                                                                                                                                                  
Global Memory Size:            939327488                                                                                                                                            
Number of Multiprocessors:     30                                                                                                                                                   
Number of Cores:               240                                                                                                                                                  
Concurrent Copy and Execution: Yes                                                                                                                                                  
Total Constant Memory:         65536                                                                                                                                                
Total Shared Memory per Block: 16384                                                                                                                                                
Registers per Block:           16384                                                                                                                                                
Warp Size:                     32                                                                                                                                                   
Maximum Threads per Block:     512                                                                                                                                                  
Maximum Block Dimensions:      512, 512, 64                                                                                                                                         
Maximum Grid Dimensions:       65535 x 65535 x 1                                                                                                                                    
Maximum Memory Pitch:          2147483647B                                                                                                                                          
Texture Alignment:             256B                                                                                                                                                 
Clock Rate:                    1404 MHz                                                                                                                                             
Initialization time:           1628 microseconds                                                                                                                                    
Current free memory:           898699008                                                                                                                                            
Upload time (4MB):             1160 microseconds ( 995 ms pinned)                                                                                                                   
Download time:                 2487 microseconds ( 936 ms pinned)
Upload bandwidth:              3615 MB/sec (4215 MB/sec pinned)
Download bandwidth:            1686 MB/sec (4481 MB/sec pinned)

I’m compiling with

-L/opt/pgi/linux86/2010/cuda/3.1/lib -lcudart -lcublas -lcuda -lcudafor
-L/opt/pgi/linux86/2010/cuda/3.1/lib -lcudart -lcublas -lcuda -lcudafor

You shouldn’t need this, but it shouldn’t hurt either. Try compiling with just “pgfortran -Mcuda=cuda3.1 test.cuf” to see what happens. Most likely, it wont make a difference.

The pgaccelinfo output looks fine so the next thing to try is to compile a CUDA C program with nvcc. You’ll need to download it from NVIDIA if you haven’t already. This will tell us if it’s a problem with your device or a problem with the PGI configuration.

  • Mat