sm_1x vs. compute_1x?

Hi,

where’s the difference between these two options (actual instances are compute_10 vs. sm_10 for my trusty GeForce 8800 GTX)? Is the sm bit reserved for the GeForces, and the compute bit reserved for the Teslas?

Ultimately, I’m trying to build fatbins, and I’m getting errors that the nvcc documentation doesn’t really help tracking down.

For instance, this setting from my Makefile
CFLAGSCUDA := $(CFLAGSCUDA) -code sm_10,sm_11,sm_13,compute_10,compute_11,compute_13
should, according to my understanding, create a real fat fatbin and leave the choice to the driver at runtime, without any dynamic compilation.
However, I get:
nvcc fatal : Illegal code generation combination: real arch=‘sm_10’, virtual code arch=‘compute_10’

Any ideas why this is happening? According to the docs, -arch defaults to sm_10, which is part of my list for fatbin creation. Actually, if I add -arch sm_10 (or -arch compute_10) to the flags, all runs smoothly. Even on “better” devices. Which leads me to another question: I am running on (currently) devices with all compute capabilities above, in a heterogeneous environment. I can of course add the corresponding -arch flag everywhere, but I’d rather not like to, as running MPI apps with a couple of different binaries is a nightmare. I can of course add -arch sm_10 everywhere, but I’m unsure if, combined with the above setting for -code, this will be the most efficient approach.

I’m compiling with -cuda to have nvcc spit out .cu.c files that I pass to other host compilers (in other words, I’m using nvcc only for kernels and kernel launch code because this is part of some larger app)

Thanks for input,

dom

Wait, how are you making fatbins?? And you’re cutting nvcc’s compile process in half and then passing it elsewhere? That’s unlikely to work, have you done sufficient reverse-engineering to figure out the correct way?

Anyway, you should pass -arch compute_10 to specify the baseline feature set (ie, no atomics, doubles, etc.) and then -code […] to specify the actual versions of bytecode and machine code to bundle inside the executables. You can say -arch is used by the parser, -code by the emitter. Honestly, though, I don’t think you need to specify anything to just get an executable that works everywhere.

EDIT: Oh, I guess the compiler thinks sm_10 is > compute_10. I suppose the machine code could contain features not supported by the bytecode standard, but I doubt it. Probably a mistake on NVIDIA’s part (in any case, -arch should default to a bytecode not a raw machine code). This whole area of nvcc is rarely used by people and isn’t well-polished.

I don’t think I am cutting anything in half:

top-level make: my MPI app with, e.g. the PGI compiler
2nd level make: my CUDA lib (and other libs): host C compiler (e.g. PGI) for the C part plus <cuda_runtime.h>; nvcc -cuda for the kernels, host compiler for the generated .cu.c files, ar everything.
top level make: link using the generated lib

According to the nvcc.pdf, -code >= -arch indeed, as I posted.

My questions:

  1. Why does -code blablbabla effectively disable the default for -arch, which is sm10?
  2. What’s the difference between compute and sm?

You’re creating a CUDA lib, then using the runtime anyway in your calling code? You should just put a C wrapper around your kernels inside the lib, encapsulate it all in there, and not have the calling code depend on the cuda runtime at all.

Regarding your two other questions, I’ve edited my previous post. Let me know if it explains anything better.

To be more clear, the compute models specify a virtual instruction set (like the one Java uses) that C kernels are compiled to. Meanwhile, the shader models specify the actual hardware instruction set (like x86) that PTX kernels are JITed to. This Java-like approach helps future-compatibility (and has other benefits), but is augmented by the ability to embed the final machine code inside your executable for performance reasons.

If all you have is -arch compute_10 and -code compute_10, the driver will recompile your PTX into machine code every time the application is launched and will work on all current architecture and [probably] all future ones. It will cause delays, but possibly allow new optimizations as the driver/JITer is updated. If you specify -arch compute_10 and -code sm_10, then it will only work on the G80 and not the G200. (I think. At least, on some near-future binary-incompatible GPU it won’t work.) And certainly the driver won’t be able to re-optimize it.

I am creating a mixed CUDA C lib, which encapsulates all cuda includes. The main app (top level make in my above post) does not use any cuda includes.

One of the reason I am doing this is that object files created by different versions of gcc are apparently incompatible. My main app is Fortran, and due to some reported (by our group) and confirmed and fixed issues with gfortran prior to gcc 4.3.x, I need to use this gcc version, which is unsupported by CUDA. I can list a lot more issues I had with other compilers, but that’s beyond the scope of this thread. Hence, nvcc -cuda is the way to go for me.

Thanks for the Java JIT analogy, that explains - for me - the difference between compute and sm.

This thread hence boils down to
a) “why does -code disable the default -arch setting”?
B) is there a performance penalty when specifying -arch sm_10 on devices with different compute capability as long as -code has all the features I need (the difference is probably in the noise, I am just curious)?

dom

Conceivably, -arch compute_11 or compute_13 could allow new PTX instructions that let nvcc automatically achieve better optimization (happens in the x86 world all the time), but I’m not aware of any. I think it’s just a features difference for now. (Features that have to be utilized manually.) EDIT: No, -arch does not affect optimization, it just prevents you from calling certain instrinsics and using certain constructs from within C code. All automatic optimizations are determined by -code.

Oh, actually, if you use 'double’s or don’t end all your number literals with ‘f’, turning on -arch compute_13 will cause a big performance drop.

[codebox]

nvcc -I/usr/local/cuda/2.0/include -g -DENABLE_PARAMETER_CHECK --host-compilation C -code compute_10,compute_11,compute_13,sm_10,sm_11,sm_13 -arch compute_13 -DENABLE_CUDA_DOUBLEPREC -cuda -o coproc_axpy_cuda.cu.c coproc_axpy_cuda.cu

nvcc fatal : Incompatible code generation requested: arch=‘compute_13’, code=‘compute_10’

[/codebox]

Well, no idea what I am doing wrong. The above (with -arch compute_10) works well for my G80. I don’t see why it shouldn’t work for the GT200 when I pass the correct -arch.

My app only works if I am not creating fatbins, or I am doing something terribly wrong. Please shed some light other than “this is probably a bug”. I can boil this down for an official bug report for nvonline if needed.

dom

I think I got you confused.

I said -arch compute_10 will work on anything. (G80 feature set will be supported by everyone). -code sm_10 won’t work on future GPUs (future chips won’t necessarily be compatible on a binary level, able to execute the same exact opcodes). -code compute_10 will work on future GPU (they’ll be able to re-compile the PTX into compatible opcodes).

-arch must be <= -code. Feature-set version must be <= source-code version. It just so happens that feature-set is specified by the version of assembly code (and it doesn’t even matter if you specify a virtual code or a machine code). It’s confusing. But, in the big picture, -arch and -code are fundamentally different things. Supported features vs actual code.

Thanks for your implicit explanation of the difference between compute and sm. Works now after dropping all sm stuff from my infrastructure.