Upgrading cuda toolkit from 4 to 5 with PGI 12.3 VS 2010

Hi all,

I just installed cuda toolkit 5.0 today, and PGI properties windows on Visual Studio 2010 seems not updating the cuda toolkit to 5, still showing 4.

please help.

Dolf

Hi Dolf,

Because we have to make adjustments to use some of the CUDA components, we ship everything needed for GPU programming. Hence, our products are independent from the CUDA Toolkit that you install from NVIDIA. CUDA 5 support will be available in the PGI 2013 release.

  • Mat

Thanks Mat.

is the following kernel structurally correct??
ngrad = 1001
nx = 306

!============================================
attributes (global) subroutine interp1_kernel(nx,ngrad,ref,temp,dn,pn)
!============================================
implicit none

integer, value :: nx,ngrad
integer :: i,ix
integer :: ixn(1001)
real(8) :: ref(nx),temp(ngrad),cox(1001),pn(ngrad),dn(ngrad)

ixn(1) = 1
ixn(ngrad) = nx-1
cox(1) = 1.d0
cox(ngrad) = 0.d0
i = (blockidx%x - 1) * blockDim%x + threadidx%x
!do i=1, nxn
if( i <= ngrad ) then
!do i=2,nxn-1
if( i >= 2 .AND. i <= ngrad-1) then
ix=ixn(i-1)
do while(.not.(ref(ix).le.temp(i) .and. ref(ix+1).gt.temp(i)))
ix=ix+1
enddo
ixn(i)=ix
!c interpolation coefficient
cox(i)=(ref(ix+1)-temp(i))/(ref(ix+1)-ref(ix))
!enddo
endif
!do i=1,nxn
pn(i)=cox(i)*dn(ixn(i))+(1.d0-cox(i))*dn(ixn(i)+1)
!enddo
endif

return
end


thanks,
Dolf

Hi Dolf,

Structurally I think you’re ok, but I do see some potential errors. I’m sure if they’re just because you’re in the middle of porting, but I point them out.

ixn is a local fixed size array. Since it’s uninitialized for most of the array, the expression “ix=ixn(i-1)” will give you a garbage value. Is this array suppose to be global or initialized locally?

Do cox and ixn need to arrays at all? You only use a few elements from them. Why not make them scalars and save the memory?

  • Mat

Hi Mat,

Thanks for the info, I was able to fix this problem.
I have a new problem, when I run the code after I compile, it runs perfectly. But when I run it for the next time (with no changes), I get NAN error message (which means in fortran “Not A Number”), usually it happen when you divide by zero.
I am suspecting this due to me not deallocating the device matrices before the end of the program, is that correct? if yes, how can I deallocate or delete all values in the memory before the end?
For the allocating, I used a subroutine to allocate all device matrices with proper size. I created another subroutine to deallocate them, but it gave me the error below:

0: DEALLOCATE: memory at 0000000000000000 not allocated

what’s the best way to solve this issue??

thanks,
Dolf

I am suspecting this due to me not deallocating the device matrices before the end of the program, is that correct?

Probably not. The memory will be freed once your program exits. More likely you have some uninitialised memory. The memory may happen to be zero the first time you run the program, but some other value the next.

0: DEALLOCATE: memory at 0000000000000000 not allocated

This mean that the memory has already been deallocated or was never allocated in the first place.

  • Mat

Probably not. The memory will be freed once your program exits. More likely you have some uninitialised memory. The memory may happen to be zero the first time you run the program, but some other value the next.

how can overcome this problem??

thanks,
Dolf

Probably the best way is to compile in emulation mode with floating point trapping enabled (-Mcuda=emu -g -Ktrap=fp). If you’re correct that it’s a divide by zero, this should pin point the place where it occurs. Otherwise, run the same binary in the debugger, PGDBG, and see if you can determine at what point the NANs start occurring. From there back track until you find the cause.

  • Mat

Probably the best way is to compile in emulation mode with floating point trapping enabled (-Mcuda=emu -g -Ktrap=fp). If you’re correct that it’s a divide by zero, this should pin point the place where it occurs.

I did compile the code in debug mode using the above keys (-Mcuda=emu -g -Ktrap=fp). it takes longer time to calculate the results. in the middle it gave me the error: PGI debug engine
Process 0: signalled FLT_INVALID_OPERATION at 0x4019ca1d, function _fmth_i_dlog, file 418, line 0

which does not make alot of sense to me. I used to pass this point with no errors when compiling without emu mode.

can you please explain? I am running out of ideas now.
Dolf

Think of “FLT_INVALID_OPERATION” as a Windows catch all for a non-specific floating point exception. Can you track down which “log” call is throwing this exception?

It may be a acceptable exception (like an underflow) in your code that can be ignored. However, you should look at what value is being passed to log and how the results are being used.

  • Mat

Can you track down which “log” call is throwing this exception?

how can I look into the log?

you should look at what value is being passed to log and how the results are being used.

??

thanks
Dolf

_fmth_i_dlog

This is a call to a double precision logarithmic (log) function. You need to determine which call to the log function is getting the floating point exception.

  • Mat

Hi Mat,

for some reason I can’t set break point in one of the subroutines which I highly suspect that there is something not right going there, since if I comment it out, everything goes well.
do you have any idea why PGDBG engine can’t reach this sub? it just freez before reaching it.

Dolf

Hi Dolf,

Since “ftm_i_dlog” is an internal name to the PGI runtime, you can’t break on it. Instead, you need to break at the line in your code containing the call to “log”. If this is what you are trying to do, do you have optimization enabled? Optimization can move code around. Try building with only “-g” enabled.

  • Mat

Hi Mat.

a hardware team from NVIDIA invited me to test drive there latest Tesla K20m cards on one of their windows 7 machines.
I need your help to guide me how to play with the compiler options and/or the block and grid dimensions in my code to get the full speed improvement.
remember that I have GeForce 460 v2 card, and I am using (as you suggested) a block size of (16,16,1).

many thanks.
Dolf