GTX480 vs GTX295 Portability

When I use this makefile for my program to run under a GTX295 GPU it works perfectly while for a GTX480 GPU it compiles but fails at the first cuda Malloc if I try to run it (on a Fermi architecture). Does anybody have any idea about that?
####### Compiler, tools and options
CC = gcc
LEX = flex
YACC = yacc
CFLAGS = -D__USE_FIXED_PROTOTYPES__ -O3 -Wall
LEXFLAGS =
YACCFLAGS= -d
LINK = gcc
LFLAGS =
LIBS = $(SUBLIBS) -lm
AR = ar cqs
RANLIB =
TAR = tar -cf
GZIP = gzip -9f
COPY = cp -f
COPY_FILE= $(COPY)
COPY_DIR = $(COPY) -r
INSTALL_FILE= $(COPY_FILE)
INSTALL_DIR = $(COPY_DIR)
DEL_FILE = rm -f
SYMLINK = ln -sf
DEL_DIR = rmdir
MOVE = mv -f
CHK_DIR_EXISTS= test -d
MKDIR = mkdir -p
####### CUDA options

path of cuda

CUDAPATH = /usr/local/cuda

path of cuda compiler

NVCC = nvcc

nvcc flags

NVCC_FLAGS = -O3 -use_fast_math

CXX_FLAGS = -I$(CUDAPATH)/include/

linking library

LD_FLAGS = -L$(CUDAPATH)/lib64/

necessary at linking phase device

LD_LIBRARIES = -lcuda -lcudart

What CUDA version are you using?

In Fermi (GTX480) 3.2 cuda version while in GTX295 2.3 cuda version.

I have to add then when I add --gpu architecture sm_20 through the nvcc flags, it works. I know this option slows down the code because enables double precision and other features, but in my code I do not need using double precision. Without --gpu architecture sm_20 option my program fails on the first cuda malloc. Is there any linking missing?

Thank you in advance for your answers.

Hi,

as far as my experience goes, these types of “first malloc” crashes are usually driver/toolkit/linkage issues.
I would suggest:

  1. make sure there are no old binaries in your obj/bin folder with which you might be accidentally linking.
  2. make sure your program is loading the correct cuda dll version (modules window in VS)
  3. install the latest 480 driver from NVidia website. ([url=“Official Drivers | NVIDIA”]Official Drivers | NVIDIA)
  4. if your program is using textures - cuda 3.2 throws first time exception when loading cudart.lib. (still no idea if this is a bug or by design - but it also happensin SDK examples.). if your IDE (Visual Studio for example) is set to capture handled exceptions - this will crash your program.
    you can workaround this by disabling cudaError handled exception catching (toolbar->Debug->Exceptions->Add: cudaError)

good luck,
eldad.

[font=“Courier New”]–gpu architecture sm_20[/font] or [font=“Courier New”]-arch sm_20[/font] does not slow down execution, it enables running the code at all (as you found out). Compute capability 2.x devices have a machine language that is completely different from compute capability 1.x devices. Trying to run code for one on the other is like trying to run an x86-compiled program on an ARM CPU.

The difference gets hidden if you compile into PTX code using [font=“Courier New”]-arch compute_10[/font] because then the PTX language for a virtual GPU architecture is further compiled into code for the real GPU used at runtime. In the end, the code always has to be compiled for the specific GPU architecture you use.

I thought there is a correlation in slowing down the code, since when in the case of GTX295 when I compile with the nvcc flags --gpu-architecture sm_13 it runs more slowly with respect to the case when I do not use these flags.

But this is not the case of GTX480. When I do not put the --gpu-architecture sm_20 flag it does not run at all. Does anybody know the reason?

My guess would be you have some memory overrun in your code. The FERMI architecture is less forgiving of illegal memory access then the previous architecture was.

I would start commenting out lines in my kernel and try to discover the bug, or try installing “Nsight” and debugging the kernel with “memory checker” option.

In this case it’s simpler than that. Ardisschool refuses to compile the code for compute capability 2.0 because that would enable double precision which he does not want. So the code does not run at all.

Ardisschool, try using [font=“Courier New”]-arch compute_11[/font]. This should do the trick, as it avoids using any of the goodies introduced after compute capability 1.1 (including double precision), but gets just-in-time translated to code for the actual device you are running on.

In the longer term, you should however remove the unwanted uses of floating point from your kernel, so you can take advantage of other new CUDA features introduced with compute capability 2.x.