call thrust::min_element function from cuda fortran

Hello there!

I’m trying to figure out how should I describe thrust function called
“min_element” in interface section:

cmin.cu:

#include <thrust>
#include <thrust>
#include <thrust>

extern "C" {
//Min for float arrays
void min_float_wrapper( float *data, int N)
{

thrust::device_ptr <float> dev_ptr(data);
thrust::min_element(dev_ptr, dev_ptr+N);
}
}

kernel.cuf:

interface thrustmin

	subroutine min_float(input,N) bind(C,name="min_float_wrapper")
	use iso_c_binding
	real(c_float),device:: input(*)
	integer(c_int),value:: N
	end subroutine

end interface

But of course this is incorrect, because thrust::min_element should return a pointer to the element, like this:

#include <thrust>
...
int data[6] = {1, 0, 2, 2, 1, 3};
int *result = thrust::min_element(data, data + 6);
// result is data + 1
// *result is 0

I’m using http://cudamusing.blogspot.com/2011/06/calling-thrust-from-cuda-fortran.html as a reference, but there is no example on how I should convert C++ pointer to Fortran pointer. I’ve read PGI Fortran Reference manual, page 371: Interoperability with C.
It says:

c_f_pointer: A subroutine that assigns the C pointer target to the Fortran pointer, and specifies its shape.
Syntax
c_f_pointer (cptr, fptr [,shape] )

But I’m still confused on syntax in interface section. Could you demonstrate simple example on how I can make it work?

My current sources are here:
http://paste.pocoo.org/show/579214/ (cmin.cu)
http://pastebin.com/PwWAYDWs (kernel.cuf)

I’ve tried to make some calculations on pointers (in wrap function) to return only index of minimum element (not C pointer), but I can’t find good explanation of class hierarchy (or just how can I convert dev_ptr to integer) in thrust library.

It should compile ok on Windows with:
nvcc -m32 -c -arch sm_13 cmin.cu --cl-version 2008 -o a.obj (Microsoft Visual Studio 2010 Command Prompt)
and then:
pgf90 -V -m32 kernel.cuf a.obj (PGI WorkStation 12.3)

But when I launch it, then error shows: http://i.imgur.com/NZoGF.png

It seems error shows after this line (in cmin.cu):
thrust::min_element(dev_ptr, dev_ptr+5);

Thank you for your hard work!

Okay, somehow I’ve managed to make it work.
http://pastebin.com/1zwzLREr - Fortran code
http://pastebin.com/XpRGTfhM - C++ wrap functions code

Though, not all seems right. When I compile cmin.cu

nvcc -m32 -c -arch sm_10 cmin.cu --cl-version 2008 -o a.obj

It works as expected (this is output: http://pastebin.com/rHEBArZH). Then, if I want to run program on double, I should rewrite this as follows:

nvcc -m32 -c -arch sm_13 cmin.cu --cl-version 2008 -o a.obj

Then I’ve got this problem after executing: http://i.imgur.com/NZoGF.png

I’m compiling kernel always with this line:

pgf90 -m32 kernel.cuf a.obj -Wl,/merge:.nv_ftab=.data

Is this expected behaviour for my system? I have NVIDIA Geforce 9600GT and Windows XP SP3. If you have graphic card with computing capability of 1.3 and higher I would like to hear if this works or not, because I can’t test this right now.

PGI Workstation version is 12.3

Hi brute11k,

I’m glad you were able to get things working.

Is this expected behaviour for my system? I have NVIDIA Geforce 9600GT and Windows XP SP3.

According to all knowing Wikipedia, the Geforce 9600GT is a CC1.0 card (See: http://en.wikipedia.org/wiki/CUDA#Supported_GPUs) so is not expected to work with double precision.

If you have graphic card with computing capability of 1.3 and higher I would like to hear if this works or not, because I can’t test this right now.

I haven’t gotten a chance to install thrust on my system yet, but will give your program a try once I do.

  • Mat

Hi brute11k,

I tested your code using a C2050 on 64-bit Linux using CUDA 4.1 and PGI 12.3. Seems to work fine. My only suggestion would be to adjust the return value by one. Right now it’s returning the zero-based index (C/C++) but should be one-based (Fortran).

% nvcc -m64 -c -arch sm_20 cmin.cu -o cmin.o
% pgfortran -V12.3 -Mcuda=4.1 kernel.cuf cmin.o
kernel.cuf:
% a.out
 host_array:
   1.8801760E-02   0.6125335       0.5604655       0.7881591     
   0.6007245       0.7127076       0.5583790       0.8871369     
   0.6943535       0.3582030       0.2493081       0.6643550     
   0.3236073       0.7110850       0.1861875       0.8814021     
   0.2928087       0.3424091       0.3236184       0.7466530    
 COPY!
 START CUDA! sort:
 START CUDA! min:
 DONE  CUDA!
 Min:            0
% a.out
 host_array:
   0.8871369       0.6943535       0.3582030       0.2493081     
   0.6643550       0.3236073       0.7110850       0.1861875     
   0.8814021       0.2928087       0.3424091       0.3236184     
   0.7466530       0.6695611       0.8935332       5.5116676E-02 
   0.8819974       0.6337899       0.3639146       0.2517362    
 COPY!
 START CUDA! sort:
 START CUDA! min:
 DONE  CUDA!
 Min:           15
  • Mat

Thank you very much, Mat! You’ve really helped me so much since then.
Well, about zero-based indexes, I’m using array unrolling to avoid designing functions for various dimensions (because I’m testing cases for 1D, 2D and will soon for 3D).

For example:

real :: 3DArray(n,m,k), 1DArray(n*m*k)
3DArray(i, j, l)=1DArray(i*m*k + j*k + l)

So, function looks like that:

attributes(global) subroutine cuda_calculate_fluidhalf(in, innext, outhalf)
	type(Fluids), device :: in, outhalf, innext
    integer :: idx

	!calculate q(n+1/2): 

    idx = M * K * (threadIdx%z + (blockIdx%z - 1) * blockDim%z - 1) + K * (threadIdx%y + (blockIdx%y - 1) * blockDim%y - 1) + (threadIdx%x + (blockIdx%x - 1) * blockDim%x - 1) + 1
    outhalf%density(idx) = ( in%density(idx) + innext%density(idx) ) / 2.0
    outhalf%energy(idx)  = ( in%energy(idx) + innext%energy(idx) ) / 2.0
    outhalf%r(idx,:)     = ( in%r(idx,:) + innext%r(idx,:) ) / 2.0
    outhalf%U(idx,:)     = ( in%U(idx,:) + innext%U(idx,:) ) / 2.0

end subroutine cuda_calculate_fluidhalf

Fluids is a structure of arrays:

integer, parameter :: dimen = 1
integer, parameter :: grid  = 384
type Fluids
	real(kind=prk) :: density(grid**dimen), energy(grid**dimen) 
	real(kind=prk) :: U(grid**dimen,dimen) !U_x(:), U_y(:), U_z(:) ! speed components
	real(kind=prk) :: r(grid**dimen,dimen) !x(:), y(:), z(:)
end type

Well, if I try to rewrite my currently “universal” code to something like that:

	type Fluids3D
		real(kind=prk) :: density(N, M, K), energy(N, M, K)
		real(kind=prk) :: U(N, M, K, dimen)
		real(kind=prk) :: r(N, M, K, dimen)
	end type

attributes(global) subroutine cuda_calculate_fluidhalf(in, innext, outhalf)

	type(Fluids), device :: in, outhalf, innext
    integer :: i, j, k

	!calculate q(n+1/2): 
    i   = threadIdx%z + (blockIdx%z - 1) * blockDim%z - 1
    j   = threadIdx%y + (blockIdx%y - 1) * blockDim%y - 1
    k   = threadIdx%x + (blockIdx%x - 1) * blockDim%x - 1

    outhalf%density(i, j, k) = ( in%density(i, j, k) + innext%density(i, j, k) ) / 2.0
    outhalf%energy(i, j, k)  = ( in%energy(i, j, k) + innext%energy(i, j, k) ) / 2.0
    outhalf%r(i, j, k,:)     = ( in%r(i, j, k,:) + innext%r(i, j, k,:) ) / 2.0
    outhalf%U(i, j, k,:)     = ( in%U(i, j, k,:) + innext%U(i, j, k,:) ) / 2.0

end subroutine cuda_calculate_fluidhalf

You’ll see that this is sort of a ugly code, because every time we need to access variables “i, j, k” to get the needed value. “idx” here looks more compact, faster (well, not really that fast, but I can give you sources so you can test it by yourself: http://pastebin.com/JY9nUpHJ) and saving memory. Moreover, it allows me to switch between test cases a lot faster

Thank you for your time!

Hi brute11k,

Well, about zero-based indexes, I’m using array unrolling to avoid designing functions for various

My comment about indices is for C to Fortran conversion. Without it, you’ll be off by one. You just need to either add one to the return index, or declare your Fortran arrays with a zero lower bound.

You’ll see that this is sort of a ugly code, because every time we need to access variables “i, j, k” to get the needed value. “idx” here looks more compact, faster

Either method will work so long as the threads in your blocks are accessing memory as contiguous blocks. Personally, I prefer using the explicit dimensions since it’s easier to keep track of the indexing. Fortran is column-major, meaning that the contiguous memory segment is the first dimension. So if the “i” index is your first column, you should be striding across this dimension and not “k”. This is the opposite as C which is row-major.

Note, your program has a major issue which is preventing the kernels from launching. Take a look at your launch configuration. Can you tell what’s wrong? Also, especially during development, adding error checking after your kernel launches to ensure your kernels are executing.

For example:

                call calculate3D<<<blocks>>>(DIM3_dev3, DIM3_dev2, DIM3_dev1)
                istat = cudaGetLastError()
                if (istat .gt. 0) then
                    print *, cudaGetErrorString(istat)
                    STOP
                 end if
                counter = counter+1
  • Mat