CudaFotran compiling problem When i am comipiling the cuda fortran code, type mismatch error is com

Praveen_PVS · November 30, 2009, 3:59am

Hello every one

I was using CUDA C till now. My work is shift monte corlo software onto GPU. since the code is in fortran, i thought it would be nice to use CUDA FORTRAN developed by PGI and NVIDIA.
i read the programming guide for CUDA fortran by PGI group.
I stared with the sample programs to get used to the languge but I am getting same errors for the all the programs which I have tried till now.
I am not able to figure out what is the error.
Please help me out to fix this problem otherwise I will not able to proceed further in my project.

I am giving the vector addition example which I have written, please look into the code and tell what is problem.
Please tell me.

1 module vectoraddition_module
2 use cudafor
3 integer, PARAMETER :: BLOCKSIZE = 16
4 contains
5
6 attributes(global) subroutine vectoraddition_kernel( DeviceVector_A, DeviceVector_B, Device_ResultVector, VectorElements )
7 real :: DeviceVector_A( VectorElements ), DeviceVector_B( VectorElements ), Device_ResultVector( VectorElements )
8 integer, value :: VectorElements
9 integer :: tidx, tidy, threadindex, threadcolumnindex, maximumthread
10
11 tidx = threadidx%x
12 tidy = threadidx%y
13 threadindex = tidx * BLOCKSIZE + tidy
14 maximumthread = BLOCKSIZE * BLOCKSIZE
15
16
17 if (threadindex < VectorElements) Device_ResultVector(threadindex) = DeviceVector_A(threadindex) + DeviceVector_B(threadindex)
19
20
21
22 call syncthreads()
23 end subroutine vectoraddition_kernel
24
25 attributes(host) subroutine vectoraddition( VectorA, VectorB, ResultVector, VectorElements )
26 real :: VectorA( VectorElements ), VectorB( VectorElements ), ResultVector( VectorElements )
27 integer, value :: VectorElements
28 real, allocatable, device :: DeviceVectorA(:), DeviceVectorB(:), DeviceResultVector(:)
29 type(dim3) :: dimGrid, dimBlock
30 integer :: istat
31
32 istat = cudaMalloc( DeviceVectorA, VectorElements )
33 istat = cudaMalloc( DeviceVectorB, VectorElements )
34 istat = cudaMalloc( DeviceResultVector, VectorElements )
35
36 istat = cudaMemcpy( DeviceVectorA, VectorA, VectorElements, cudaMemcpyHostToDevice )
37 istat = cudaMemcpy( DeviceVectorB, VectorB, VectorElements, cudaMemcpyHostToDevice )
38
39 dimGrid = dim3( 1, 1, 1 )
40 dimblock = dim3( BLOCKSIZE, BLOCKSIZE, 1 )
41
42 call vectoraddition_kernel<<<dimGrid, dimBlock>>>( DeviceVectorA, DeviceVectorB, DeviceResultVector, VectorElements )
43
44 istat = cudaMemcpy(ResultVector, DeviceResultVector, VectorElements, cudaMemcpyDeviceToHost)
45
46 istat = cudaFree(DeviceVectorA)
47 istat = cudaFree(DeviceVectorB)
48 istat = cudaFree(DeviceResultVector)
49 end subroutine vectoraddition
50
51 end module vectoraddition_module
52
53 program vectoraddition_program
54 use vectoraddition_module
55 IMPLICIT NONE
56 real, allocatable :: VectorA(:), VectorB(:), ResultVector(:)
57 integer :: VectorElements, index
58
59 VectorElements = BLOCKSIZE * BLOCKSIZE
60
61 allocate(VectorA(VectorElements))
62 allocate(VectorB(VectorElements))
63 allocate(ResultVector(VectorElements))
64
65 do index = 1,VectorElements
66 VectorA(Index) = 1.0
67 VectorB(index) = 1.0
68 ResultVector(index) = 0.0
69 end do
70
71 call vectoraddition(VectorA, VectorB, ResultVector, VectorElements)
72
73 do index = 1,VectorElements
74 write(*)ResultVector(index)
75 end do
76
77 deallocate(VectorA)
78 deallocate(VectorB)
79 deallocate(ResultVector)
80
81 end program vectoraddition_program

When I am compiling I am getting these errors

PGF90-S-0188-Argument number 1 to vectoraddition_kernel: type mismatch (VectorAdditionCudaFortran.cuf: 42)
PGF90-S-0188-Argument number 2 to vectoraddition_kernel: type mismatch (VectorAdditionCudaFortran.cuf: 42)
PGF90-S-0188-Argument number 3 to vectoraddition_kernel: type mismatch (VectorAdditionCudaFortran.cuf: 42)
0 inform, 0 warnings, 3 severes, 0 fatal for vectoraddition

Same mismatch errors are coming for other programs which I have written.

Please tell me where I have gone wrong. This is very important to me.
I will be very grateful to you if you guide me to fix this problem.
Thank you

With love and Regards
Praveen

Sarnath · November 30, 2009, 5:27am

I have no idea about cuf… But I will just use my common sense to answer this…

The subroutine vectorAddition declares the first 3 arguments of the kernel as "real, allocatable, device :: "

However in the kernel part, you just use “real”… May be, the “device” part is needed ???

Praveen_PVS · November 30, 2009, 5:35am

Hello sir

Thank you very much for your kind help.

In the vector addition subroutine, I have to declare it as device variable so that it is known as device variable and it is dynamic allocation thats why allocatable.

but in the kernel definition any how it is device code. I dont think so I have to declare again it as device and other part.

Sir, Can you tell me who can help me out of this.

I need this very badly.

Thank you once again.

Sarnath · November 30, 2009, 6:26am

I am not sure what the problem is… btw, when you do “cudamalloc” , should you not multiply vector size with “sizeof(REAL)” ?? OR Does CUDA Fortran automatically take care??

btw, I hope you are using “cuda” compiler to compile that code…

Good Luck,

Praveen_PVS · November 30, 2009, 7:19am

CUDA fortran Takes care of that. we have to mention how many elements we want to malloc.

I am using pgfortran compiler to do this.

In the documentation they have given pgfortran compiler to use.

Thanks

avidday · November 30, 2009, 8:32am

You should ask PGI about this - I doubt anyone here can help you.

cgorac · November 30, 2009, 2:35pm

@Praveen PVS: Just tried to compile your code on my PGI installation (PGI Workstation 10.0, under Slackware64 13.0), and it built properly; so - maybe you should double check your installation?

On the other side, your kernel code is almost all wrong, please compare with example from page 9 of CUDA Fortran Programming Guide to see what you need to change. Further, you may wish to skip using cudaMalloc()/cudaFree() and alike calls in the host code - nicety of CUDA Fortran is that all of alike operations could be accomplished through pure Fortran syntax (for example allocate()/deallocate() in this particular case).

Praveen_PVS · November 30, 2009, 3:28pm

I am using NCSA cluster which has this installation. I got conformed from the admin people of NCSA cluster that pgi compiler enabled with CUDA has been installed on the cluster.

Coming to the kernel, I have made changes to that now. For checking something I have made some changes but after that I have not changed back.

Can you see once and tell if there is any thing wrong now??

since I was using CUDA C so i like to use CUDA MALLOC etc run time routines. I am aware that instead CUDA MEM CPY we can use normal assignments statements between device and host variables.

So, what I should do to fix this problem.

Please tell me.

Thanks a lot.

Praveen_PVS · November 30, 2009, 3:29pm

Thanks for your reply.

How to approach these group of people. Can you guide me regarding this?

avidday · November 30, 2009, 3:57pm

Their user forums, perhaps?

cgorac · November 30, 2009, 4:01pm

In that case, you should check first which version of PGI compilers is installed there (“pgfortran -V”) - as mentioned above, seems like that the code compiles with version 10.0, maybe you’re working with some kind of an earlier version. Regarding the syntax error, I’d try with adding “device” qualifier in the declaration in the line 7. Regarding contacting PGI support, you should check with admins of the installation you’re using: is this installation under some kind of support contract, etc. Alternatively, you could just register and post to PGI user forum.

Side note: you may wish to edit your initial post, reformat the code, and put it under the code tag - it’s not much fun, when trying to help, to have to employ say sed to remove line numbers, and then again to change references to smiley images back to colon-parenthesis sequences… Here is an example of code that works on my machine:

! -*- f90 -*-

module vector_add_module

  use cudafor

integer, parameter :: BLOCK_SIZE = 64

  integer, parameter :: NUM_BLOCKS = 1

contains

  attributes(global) subroutine vector_add_kernel(n, ad, bd, cd)

	integer, value :: n

	real, dimension(n), device :: ad, bd, cd

	integer :: i

	i = (blockIdx%x - 1) * blockDim%x + threadIdx%x

	if (i <= n) then

	   cd(i) = ad(i) + bd(i)

	end if

  end subroutine vector_add_kernel

subroutine vector_add(n, a, b, c)

	integer :: n

	real, dimension(n) :: a, b, c

	real, dimension(:), allocatable, device :: ad, bd, cd

	integer :: error

	allocate(ad(n), bd(n), cd(n))

	ad = a

	bd = b

	call vector_add_kernel<<<n / BLOCK_SIZE, BLOCK_SIZE>>>(n, ad, bd, cd)

	error = cudaGetLastError()

	if (error /= 0) stop

	c = cd

	deallocate(ad, bd, cd)

  end subroutine vector_add

end module vector_add_module

program vector_add_program

  use vector_add_module

integer :: n

  real, dimension(:), allocatable :: a, b, c

  integer :: i

n = NUM_BLOCKS * BLOCK_SIZE

  allocate(a(n), b(n), c(n))

  a = 1

  b = 1

call vector_add(n, a, b, c)

  do i = 1, n

	 write (*, *) c(i)

  end do

deallocate(a, b, c)

end program vector_add_program

My build command is (after the code above put into foo.cuf file; also notice that I’m on machine with CUDA 1.1 capable card only, so I have to specify the architecture explicitly):

/opt/pgi/linux86-64/10.0/bin/pgfortran -o foo -Mcuda=cc11 foo.cuf

NAMD · December 1, 2009, 1:40am

Hi

Even I am trying to run my fortran Monte Carlo codes on GPU . I am using fortran wrapper for CUDA C,as Cuda Fortran documentation is not enough.Even I am using Lincoln cluster at NCSA.So I think we can share our code problems.

Praveen_PVS · December 1, 2009, 4:50am

In that case, you should check first which version of PGI compilers is installed there (“pgfortran -V”) - as mentioned above, seems like that the code compiles with version 10.0, maybe you’re working with some kind of an earlier version. Regarding the syntax error, I’d try with adding “device” qualifier in the declaration in the line 7. Regarding contacting PGI support, you should check with admins of the installation you’re using: is this installation under some kind of support contract, etc. Alternatively, you could just register and post to PGI user forum.

Side note: you may wish to edit your initial post, reformat the code, and put it under the code tag - it’s not much fun, when trying to help, to have to employ say sed to remove line numbers, and then again to change references to smiley images back to colon-parenthesis sequences… Here is an example of code that works on my machine:
! -*- f90 -*-

module vector_add_module

  use cudafor

integer, parameter :: BLOCK_SIZE = 64

  integer, parameter :: NUM_BLOCKS = 1

contains

  attributes(global) subroutine vector_add_kernel(n, ad, bd, cd)

	integer, value :: n

	real, dimension(n), device :: ad, bd, cd

	integer :: i

	i = (blockIdx%x - 1) * blockDim%x + threadIdx%x

	if (i <= n) then

	   cd(i) = ad(i) + bd(i)

	end if

  end subroutine vector_add_kernel

subroutine vector_add(n, a, b, c)

	integer :: n

	real, dimension(n) :: a, b, c

	real, dimension(:), allocatable, device :: ad, bd, cd

	integer :: error

	allocate(ad(n), bd(n), cd(n))

	ad = a

	bd = b

	call vector_add_kernel<<<n / BLOCK_SIZE, BLOCK_SIZE>>>(n, ad, bd, cd)

	error = cudaGetLastError()

	if (error /= 0) stop

	c = cd

	deallocate(ad, bd, cd)

  end subroutine vector_add

end module vector_add_module

program vector_add_program

  use vector_add_module

integer :: n

  real, dimension(:), allocatable :: a, b, c

  integer :: i

n = NUM_BLOCKS * BLOCK_SIZE

  allocate(a(n), b(n), c(n))

  a = 1

  b = 1

call vector_add(n, a, b, c)

  do i = 1, n

	 write (*, *) c(i)

  end do

deallocate(a, b, c)

end program vector_add_program
My build command is (after the code above put into foo.cuf file; also notice that I’m on machine with CUDA 1.1 capable card only, so I have to specify the architecture explicitly):
/opt/pgi/linux86-64/10.0/bin/pgfortran -o foo -Mcuda=cc11 foo.cuf

Hello sir

Thanks a lot for your kind help. It started working after declaring the variable as device inside the kernel code. But I did not find it in the programming guide and even in the example they have given in the last chapter, they did not do that.

Any how thanks a lot.

I like to know the logic behind the grid size and block size in your code.

In CUDA C we have to use dim grid and dim block which can be one, two and three dimensional.

In CUDA Fortran also we can use dim grid and dim block which of type dim3 but I think it is slightlycan different from CUDA C usage.

In your code, you have used integer instead of dim3 type.

can you explain me clearly how to give for n number of blocks and m number of threads per block?

Thanks a lot once gain.

Note: I am new to the forum that why my presentation skills are not so good. Even I did not know how those smileys came instead on parenthesis. I am very sorry for bad way of presenting it.

cgorac · December 1, 2009, 8:42am

@vkumar6:
I guess it depends much on one’s previous experience with Fortran (and CUDA). As for myself: I have lots of Fortran experience, and like Fortran much, so I find what PGI built to be very nice, and easy and natural to use - at the moment, seems to me that I actually prefer to code kernels (as well as host code) in Fortran, than in C (and especially C++). But PGI tools, and the documentation, are admittedly not as mature as their NVIDIA counterparts; let’s hope this will improve quickly.

@Praveen PVS:
Hmm, indeed the programming guide is missing “device” qualifier; but as I mentioned it earlier, your code compiles on my machine, with latest version of the PGI tools installed, so it could be that the version of tools you are using is older, and that the need for this qualifier is removed in the meantime.

As for grid size and block size in my code: the problem in question is one-dimensional, so I see no need for using more than one dimension regarding the block size - the kernel code is simpler this way. When you put an integer, let’s say “N”, instead of dim3 for the block size (or the grid size) in the kernel subroutine call statement, then it gets interpreted as dim3(N, 1, 1) - this is explained in the section 3.5 of the programming guide. I’ve put 64 as the block size there, and this choice is pretty much random - in the production code, one would have to take into account the number of registers used, the amount of shared memory used, etc. (occupancy calculator is built exactly for this kind of calculations). As for grid size, I’ve used one dimension only to have the code simpler, but in real circumstances, one would have to take very large vectors into account, together with grid size limitations along each dimension, so probably the grid size would use more dimensions, and maybe kernel would be even have to be started in a loop for very, very large input vectors. But, I posted the code just as an example for you to start with.

Topic		Replies	Views
Operators both on host and device functions Legacy PGI Compilers	21	10646	October 12, 2010
CUDA fortran VecAdd code failed to give correct results Legacy PGI Compilers	5	7176	October 23, 2009
how does this newbie fix his code? Legacy PGI Compilers	8	3414	March 27, 2012
An Easy Introduction to CUDA Fortran Technical Blog	7	564	June 21, 2024
CUDA Fortran matrix-multiply 10x slower than CUDA C version Legacy PGI Compilers	5	6880	July 14, 2010
cudaSetDevice seems completely broken Legacy PGI Compilers	12	15851	December 30, 2010
Declaring local arrays in device code Legacy PGI Compilers	16	9064	June 8, 2012
Error running simple CUDA Fortran program Legacy PGI Compilers	9	21305	February 26, 2010
How can I compile CUDA code then link it to a C++/CLR project CUDA Programming and Performance	21	12962	August 21, 2017
Translating FORTRAN to C++ to CUDA advice CUDA Programming and Performance	19	23241	February 1, 2010

CudaFotran compiling problem When i am comipiling the cuda fortran code, type mismatch error is com

Related topics