CudaFotran compiling problem When i am comipiling the cuda fortran code, type mismatch error is com

Hello every one

I was using CUDA C till now. My work is shift monte corlo software onto GPU. since the code is in fortran, i thought it would be nice to use CUDA FORTRAN developed by PGI and NVIDIA.
i read the programming guide for CUDA fortran by PGI group.
I stared with the sample programs to get used to the languge but I am getting same errors for the all the programs which I have tried till now.
I am not able to figure out what is the error.
Please help me out to fix this problem otherwise I will not able to proceed further in my project.

I am giving the vector addition example which I have written, please look into the code and tell what is problem.
Please tell me.

1 module vectoraddition_module
2 use cudafor
3 integer, PARAMETER :: BLOCKSIZE = 16
4 contains
5
6 attributes(global) subroutine vectoraddition_kernel( DeviceVector_A, DeviceVector_B, Device_ResultVector, VectorElements )
7 real :: DeviceVector_A( VectorElements ), DeviceVector_B( VectorElements ), Device_ResultVector( VectorElements )
8 integer, value :: VectorElements
9 integer :: tidx, tidy, threadindex, threadcolumnindex, maximumthread
10
11 tidx = threadidx%x
12 tidy = threadidx%y
13 threadindex = tidx * BLOCKSIZE + tidy
14 maximumthread = BLOCKSIZE * BLOCKSIZE
15
16
17 if (threadindex < VectorElements) Device_ResultVector(threadindex) = DeviceVector_A(threadindex) + DeviceVector_B(threadindex)
19
20
21
22 call syncthreads()
23 end subroutine vectoraddition_kernel
24
25 attributes(host) subroutine vectoraddition( VectorA, VectorB, ResultVector, VectorElements )
26 real :: VectorA( VectorElements ), VectorB( VectorElements ), ResultVector( VectorElements )
27 integer, value :: VectorElements
28 real, allocatable, device :: DeviceVectorA(:), DeviceVectorB(:), DeviceResultVector(:)
29 type(dim3) :: dimGrid, dimBlock
30 integer :: istat
31
32 istat = cudaMalloc( DeviceVectorA, VectorElements )
33 istat = cudaMalloc( DeviceVectorB, VectorElements )
34 istat = cudaMalloc( DeviceResultVector, VectorElements )
35
36 istat = cudaMemcpy( DeviceVectorA, VectorA, VectorElements, cudaMemcpyHostToDevice )
37 istat = cudaMemcpy( DeviceVectorB, VectorB, VectorElements, cudaMemcpyHostToDevice )
38
39 dimGrid = dim3( 1, 1, 1 )
40 dimblock = dim3( BLOCKSIZE, BLOCKSIZE, 1 )
41
42 call vectoraddition_kernel<<<dimGrid, dimBlock>>>( DeviceVectorA, DeviceVectorB, DeviceResultVector, VectorElements )
43
44 istat = cudaMemcpy(ResultVector, DeviceResultVector, VectorElements, cudaMemcpyDeviceToHost)
45
46 istat = cudaFree(DeviceVectorA)
47 istat = cudaFree(DeviceVectorB)
48 istat = cudaFree(DeviceResultVector)
49 end subroutine vectoraddition
50
51 end module vectoraddition_module
52
53 program vectoraddition_program
54 use vectoraddition_module
55 IMPLICIT NONE
56 real, allocatable :: VectorA(:), VectorB(:), ResultVector(:)
57 integer :: VectorElements, index
58
59 VectorElements = BLOCKSIZE * BLOCKSIZE
60
61 allocate(VectorA(VectorElements))
62 allocate(VectorB(VectorElements))
63 allocate(ResultVector(VectorElements))
64
65 do index = 1,VectorElements
66 VectorA(Index) = 1.0
67 VectorB(index) = 1.0
68 ResultVector(index) = 0.0
69 end do
70
71 call vectoraddition(VectorA, VectorB, ResultVector, VectorElements)
72
73 do index = 1,VectorElements
74 write(*)ResultVector(index)
75 end do
76
77 deallocate(VectorA)
78 deallocate(VectorB)
79 deallocate(ResultVector)
80
81 end program vectoraddition_program

When I am compiling I am getting these errors

PGF90-S-0188-Argument number 1 to vectoraddition_kernel: type mismatch (VectorAdditionCudaFortran.cuf: 42)
PGF90-S-0188-Argument number 2 to vectoraddition_kernel: type mismatch (VectorAdditionCudaFortran.cuf: 42)
PGF90-S-0188-Argument number 3 to vectoraddition_kernel: type mismatch (VectorAdditionCudaFortran.cuf: 42)
0 inform, 0 warnings, 3 severes, 0 fatal for vectoraddition

Same mismatch errors are coming for other programs which I have written.

Please tell me where I have gone wrong. This is very important to me.
I will be very grateful to you if you guide me to fix this problem.
Thank you

With love and Regards
Praveen

I have no idea about cuf… But I will just use my common sense to answer this…

The subroutine vectorAddition declares the first 3 arguments of the kernel as "real, allocatable, device :: "

However in the kernel part, you just use “real”… May be, the “device” part is needed ???

Hello sir

Thank you very much for your kind help.

In the vector addition subroutine, I have to declare it as device variable so that it is known as device variable and it is dynamic allocation thats why allocatable.

but in the kernel definition any how it is device code. I dont think so I have to declare again it as device and other part.

Sir, Can you tell me who can help me out of this.

I need this very badly.

Thank you once again.

I am not sure what the problem is… btw, when you do “cudamalloc” , should you not multiply vector size with “sizeof(REAL)” ?? OR Does CUDA Fortran automatically take care??

btw, I hope you are using “cuda” compiler to compile that code…

Good Luck,

CUDA fortran Takes care of that. we have to mention how many elements we want to malloc.

I am using pgfortran compiler to do this.

In the documentation they have given pgfortran compiler to use.

Thanks

You should ask PGI about this - I doubt anyone here can help you.

@Praveen PVS: Just tried to compile your code on my PGI installation (PGI Workstation 10.0, under Slackware64 13.0), and it built properly; so - maybe you should double check your installation?

On the other side, your kernel code is almost all wrong, please compare with example from page 9 of CUDA Fortran Programming Guide to see what you need to change. Further, you may wish to skip using cudaMalloc()/cudaFree() and alike calls in the host code - nicety of CUDA Fortran is that all of alike operations could be accomplished through pure Fortran syntax (for example allocate()/deallocate() in this particular case).

I am using NCSA cluster which has this installation. I got conformed from the admin people of NCSA cluster that pgi compiler enabled with CUDA has been installed on the cluster.

Coming to the kernel, I have made changes to that now. For checking something I have made some changes but after that I have not changed back.

Can you see once and tell if there is any thing wrong now??

since I was using CUDA C so i like to use CUDA MALLOC etc run time routines. I am aware that instead CUDA MEM CPY we can use normal assignments statements between device and host variables.

So, what I should do to fix this problem.

Please tell me.

Thanks a lot.

Thanks for your reply.

How to approach these group of people. Can you guide me regarding this?

Their user forums, perhaps?

In that case, you should check first which version of PGI compilers is installed there (“pgfortran -V”) - as mentioned above, seems like that the code compiles with version 10.0, maybe you’re working with some kind of an earlier version. Regarding the syntax error, I’d try with adding “device” qualifier in the declaration in the line 7. Regarding contacting PGI support, you should check with admins of the installation you’re using: is this installation under some kind of support contract, etc. Alternatively, you could just register and post to PGI user forum.

Side note: you may wish to edit your initial post, reformat the code, and put it under the code tag - it’s not much fun, when trying to help, to have to employ say sed to remove line numbers, and then again to change references to smiley images back to colon-parenthesis sequences… Here is an example of code that works on my machine:

! -*- f90 -*-

module vector_add_module

  use cudafor

integer, parameter :: BLOCK_SIZE = 64

  integer, parameter :: NUM_BLOCKS = 1

contains

  attributes(global) subroutine vector_add_kernel(n, ad, bd, cd)

	integer, value :: n

	real, dimension(n), device :: ad, bd, cd

	integer :: i

	i = (blockIdx%x - 1) * blockDim%x + threadIdx%x

	if (i <= n) then

	   cd(i) = ad(i) + bd(i)

	end if

  end subroutine vector_add_kernel

subroutine vector_add(n, a, b, c)

	integer :: n

	real, dimension(n) :: a, b, c

	real, dimension(:), allocatable, device :: ad, bd, cd

	integer :: error

	allocate(ad(n), bd(n), cd(n))

	ad = a

	bd = b

	call vector_add_kernel<<<n / BLOCK_SIZE, BLOCK_SIZE>>>(n, ad, bd, cd)

	error = cudaGetLastError()

	if (error /= 0) stop

	c = cd

	deallocate(ad, bd, cd)

  end subroutine vector_add

end module vector_add_module

program vector_add_program

  use vector_add_module

integer :: n

  real, dimension(:), allocatable :: a, b, c

  integer :: i

n = NUM_BLOCKS * BLOCK_SIZE

  allocate(a(n), b(n), c(n))

  a = 1

  b = 1

call vector_add(n, a, b, c)

  do i = 1, n

	 write (*, *) c(i)

  end do

deallocate(a, b, c)

end program vector_add_program

My build command is (after the code above put into foo.cuf file; also notice that I’m on machine with CUDA 1.1 capable card only, so I have to specify the architecture explicitly):

/opt/pgi/linux86-64/10.0/bin/pgfortran -o foo -Mcuda=cc11 foo.cuf

Hi

Even I am trying to run my fortran Monte Carlo codes on GPU . I am using fortran wrapper for CUDA C,as Cuda Fortran documentation is not enough.Even I am using Lincoln cluster at NCSA.So I think we can share our code problems.

Hello sir

Thanks a lot for your kind help. It started working after declaring the variable as device inside the kernel code. But I did not find it in the programming guide and even in the example they have given in the last chapter, they did not do that.

Any how thanks a lot.

I like to know the logic behind the grid size and block size in your code.

In CUDA C we have to use dim grid and dim block which can be one, two and three dimensional.

In CUDA Fortran also we can use dim grid and dim block which of type dim3 but I think it is slightlycan different from CUDA C usage.

In your code, you have used integer instead of dim3 type.

can you explain me clearly how to give for n number of blocks and m number of threads per block?

Thanks a lot once gain.

Note: I am new to the forum that why my presentation skills are not so good. Even I did not know how those smileys came instead on parenthesis. I am very sorry for bad way of presenting it.

@vkumar6:
I guess it depends much on one’s previous experience with Fortran (and CUDA). As for myself: I have lots of Fortran experience, and like Fortran much, so I find what PGI built to be very nice, and easy and natural to use - at the moment, seems to me that I actually prefer to code kernels (as well as host code) in Fortran, than in C (and especially C++). But PGI tools, and the documentation, are admittedly not as mature as their NVIDIA counterparts; let’s hope this will improve quickly.

@Praveen PVS:
Hmm, indeed the programming guide is missing “device” qualifier; but as I mentioned it earlier, your code compiles on my machine, with latest version of the PGI tools installed, so it could be that the version of tools you are using is older, and that the need for this qualifier is removed in the meantime.

As for grid size and block size in my code: the problem in question is one-dimensional, so I see no need for using more than one dimension regarding the block size - the kernel code is simpler this way. When you put an integer, let’s say “N”, instead of dim3 for the block size (or the grid size) in the kernel subroutine call statement, then it gets interpreted as dim3(N, 1, 1) - this is explained in the section 3.5 of the programming guide. I’ve put 64 as the block size there, and this choice is pretty much random - in the production code, one would have to take into account the number of registers used, the amount of shared memory used, etc. (occupancy calculator is built exactly for this kind of calculations). As for grid size, I’ve used one dimension only to have the code simpler, but in real circumstances, one would have to take very large vectors into account, together with grid size limitations along each dimension, so probably the grid size would use more dimensions, and maybe kernel would be even have to be started in a loop for very, very large input vectors. But, I posted the code just as an example for you to start with.