Character Constant Error and GPU Machine Constant Question

Hello,

I have some code that produces either an ICE or error in the .gpu code (which I assume is like object code for the GPU). It has to do with passing a character type as a parameter to another device routine. If I instead try to pass the character constant directly (see the commented call to sgemv in the code below), I get an ICE: unexpected address of constant. Otherwise, I get errors in the .gpu like this:

/tmp/pgcudafor25EgI8rjAavK.gpu(43): error: expected an expression
/tmp/pgcudafor25EgI8rjAavK.gpu(44): error: expected an expression
/tmp/pgcudafor25EgI8rjAavK.gpu(45): error: expected an expression
/tmp/pgcudafor25EgI8rjAavK.gpu(46): error: too many arguments in function call

My suspicion is it is an error in the compiler. I’m using version 10.6.

Here’s the offending code:

MODULE imsim_module

use cudafor

contains

subroutine kernel_setup(pixxwidth, pixywidth)

real :: pixxwidth, pixywidth
real, device :: pixxwidthDev, pixywidthDev
type(dim3) :: dimGrid, dimBlock

pixxwidthDev = pixxwidth
pixywidthDev = pixywidth

dimGrid = dim3( detxsize/16, detysize/16, 1 )
dimBlock = dim3( 16, 16, 1 )
call kernel<<<dimGrid,dimBlock>>>(pixxwidthDev, pixywidthDev)

end subroutine kernel_setup



attributes(device) LOGICAL FUNCTION MYLSAME( CA, CB )
CHARACTER CA, CB
MYLSAME = CA.EQ.CB
RETURN
END



attributes(device) SUBROUTINE SGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
REAL ALPHA, BETA
INTEGER INCX, INCY, LDA, M, N
CHARACTER1 TRANS
REAL A( LDA, * ), X( * ), Y( * )
REAL ONE , ZERO
PARAMETER ( ONE = 1.0E+0, ZERO = 0.0E+0 )
REAL TEMP
INTEGER I, INFO, IX, IY, J, JX, JY, KX, KY, LENX, LENY
LOGICAL MYLSAME
INTRINSIC MAX
CHARACTER
1 NN,TT,CC
NN = ‘N’
TT = ‘T’
CC = ‘C’

INFO = 0
IF ( .NOT.MYLSAME( TRANS, NN ).AND…NOT.MYLSAME( TRANS, TT ).AND…NOT.MYLSAME( TRANS, CC ) )THEN
INFO = 1
ELSE IF( M.LT.0 )THEN
INFO = 2
ELSE IF( N.LT.0 )THEN
INFO = 3
ELSE IF( LDA.LT.MAX( 1, M ) )THEN
INFO = 6
ELSE IF( INCX.EQ.0 )THEN
INFO = 8
ELSE IF( INCY.EQ.0 )THEN
INFO = 11
END IF
IF( INFO.NE.0 )THEN
RETURN
END IF
IF( ( M.EQ.0 ).OR.( N.EQ.0 ).OR.( ( ALPHA.EQ.ZERO ).AND.( BETA.EQ.ONE ) ) ) RETURN
IF( MYLSAME( TRANS, NN ) )THEN
LENX = N
LENY = M
ELSE
LENX = M
LENY = N
END IF
IF( INCX.GT.0 )THEN
KX = 1
ELSE
KX = 1 - ( LENX - 1 )INCX
END IF
IF( INCY.GT.0 )THEN
KY = 1
ELSE
KY = 1 - ( LENY - 1 )INCY
END IF
IF( BETA.NE.ONE )THEN
IF( INCY.EQ.1 )THEN
IF( BETA.EQ.ZERO )THEN
DO 10, I = 1, LENY
Y( I ) = ZERO
10 CONTINUE
ELSE
DO 20, I = 1, LENY
Y( I ) = BETA
Y( I )
20 CONTINUE
END IF
ELSE
IY = KY
IF( BETA.EQ.ZERO )THEN
DO 30, I = 1, LENY
Y( IY ) = ZERO
IY = IY + INCY
30 CONTINUE
ELSE
DO 40, I = 1, LENY
Y( IY ) = BETA
Y( IY )
IY = IY + INCY
40 CONTINUE
END IF
END IF
END IF
IF( ALPHA.EQ.ZERO ) RETURN
IF( MYLSAME( TRANS, NN ) )THEN
JX = KX
IF( INCY.EQ.1 )THEN
DO 60, J = 1, N
IF( X( JX ).NE.ZERO )THEN
TEMP = ALPHAX( JX )
DO 50, I = 1, M
Y( I ) = Y( I ) + TEMP
A( I, J )
50 CONTINUE
END IF
JX = JX + INCX
60 CONTINUE
ELSE
DO 80, J = 1, N
IF( X( JX ).NE.ZERO )THEN
TEMP = ALPHAX( JX )
IY = KY
DO 70, I = 1, M
Y( IY ) = Y( IY ) + TEMP
A( I, J )
IY = IY + INCY
70 CONTINUE
END IF
JX = JX + INCX
80 CONTINUE
END IF
ELSE
JY = KY
IF( INCX.EQ.1 )THEN
DO 100, J = 1, N
TEMP = ZERO
DO 90, I = 1, M
TEMP = TEMP + A( I, J )X( I )
90 CONTINUE
Y( JY ) = Y( JY ) + ALPHA
TEMP
JY = JY + INCY
100 CONTINUE
ELSE
DO 120, J = 1, N
TEMP = ZERO
IX = KX
DO 110, I = 1, M
TEMP = TEMP + A( I, J )X( IX )
IX = IX + INCX
110 CONTINUE
Y( JY ) = Y( JY ) + ALPHA
TEMP
JY = JY + INCY
120 CONTINUE
END IF
END IF

RETURN
END


attributes(global) subroutine kernel(pixxwidth, pixywidth)
real :: pixxwidth, pixywidth
character, device :: TT
real, device :: A(4,2), X(2), Y(2)
real, device :: alpha, beta

alpha = 1.0
beta = 1.0
TT = ‘T’

! call sgemv(‘T’, 4, 2, alpha, A, 4, X, 1, beta, Y, 1 )
call sgemv(TT, 4, 2, alpha, A, 4, X, 1, beta, Y, 1 )

end subroutine kernel

END MODULE imsim_module



On another note, I’ve been working with some code that I’m porting from F77 code. The code uses BLAS/LAPACK/other netlib routines that use optimized machine constant code (e.g., r1mach.f, d1mach.f, etc. off of netlib). Is there some existing code for the CUDA Fortran that allows me to optimize my routines to the GPU architecture with machine constants?

Thanks.

Thanks for the great example. We hadn’t expected much use of character on the GPU, and didn’t have a good suite of internal tests. Yes, these are compiler problems, which unfortunately are getting fixed too late for the August build. They should be ready for the September build, however.
Regarding your question about the machine constants (r1mach, etc.): we haven’t created nor have we seen these for the GPUs yet. However, since the NVIDIA cards implement IEEE floating point, the values should be the same as any host IEEE implementation, as you would get on any host.

Hi gv1579,

There are a number of compiler issues here.

The first one is caused by the declaration of “LOGICAL MYLSAME” within ‘SGEMV’. The compiler is trying to resolve this to an external host routine instead of the module’s device routine. I’ve reported this error as TPR#17168. The work around is to remove the function declaration from SGEMV.

The second is the use of the character constants. Looks like we’re generating incorrect GPU code for these. If add TPR#17166 for this issue. There is no work around other then to not use character constants.

Third, we’ll generate argument type mismatches unless your device function variables are declared using F90 syntax (i.e. " CHARACTER :: CA, CB" instead of " CHARACTER CA, CB"). Reported as TPR#17167. The work around is to use F90 syntax.

We do appreciate it when user try unusual code since it exposes problems. However, I’m wondering what your goal is with this code?

If you’re looking for a GPU enabled BLAS library, you should be able to use NVIDIA’s CUBLAS library. It doesn’t include SGEMV, but you should be able use to SGEMM instead. http://developer.nvidia.com/object/cuda_3_1_downloads.html. I don’t think they have a mach file, but please ask NVIDIA to be sure.

Thanks,
Mat

Thank you both for your comments.

In regard to the internal declaration of LSAME in SGEMV: I apologize, I am porting some Fortrann 77 simulation code to the GPU and missed removing the declaration. Ditto for the character declarations.

This code is trimmed out of a larger physical chemistry simulation (I’m porting it from Fortran 77 to CUDA Fortran as a summer project at my university). It is computationally intense and does include BLAS calls. I am aware of the CUBLAS library. However, I cannot make use of the CUBLAS library because the calls to BLAS in the simulation code are from device subprograms instead of host subprograms. To the best of my understanding of the CUBLAS code, CUBLAS provides an interface (i.e., Fortran bindings) for Fortran programs running on the host to generate a kernel to execute a BLAS routine on the device. The simulation code I am porting has many nested levels of subprogram calls (to which I have made much use of the attributes(device) qualifier). I don’t think I can make a call to CUBLAS and generate another kernel from inside the currently executing kernel. So, unless there is another workaround, I am left with inserting BLAS code directly into my existing code.

And thanks for the information about the GPU using the IEEE standard. I could manually make a hack in the machine constant code (i.e., r1mach.f, etc.) to only return values for IEEE. That seems to be the easiest workaround until someone creates a GPU version of machine constant code.

In regard to the internal declaration of LSAME in SGEMV: I apologize, I am porting some Fortrann 77 simulation code to the GPU and missed removing the declaration. Ditto for the character declarations.

No apology. The compiler should have handled this.

I am left with inserting BLAS code directly into my existing code

You are correct that CUBLAS can only be called from host code so wont work when called from a kernel, but I’m wondering if calling SGEMV from a kernel is necessary?

CUDA kernels are serial code that get duplicated thousands (even millions of times) and run in parallel. Kernels will have some local variables and typically work on a small part of a global data set. If you’re trying to run SGEMV over a global matrix and vector, then you’ll encounter performance, and possibly data synchronization problems.

If you’re needing to call it using small local arrays or small portions of global arrays, then you’re most likely better off just adding a simple matrix times vector directly in your code. This would remove SGEMV’s overhead and maybe allowing you to utilize caching.

  • Mat

gv1579,

TPR 17167 - CUDA Function arguments mismatch unless declared with “::”
Has been corrected in the current 11.0 release. Thanks for your report.


dave