Programing CUDA C with different GPU's

I was wondering if I would have any issues if I wrote programs and tested them with a cheap graphics card, say a GTX 950, and then tried to run them for best performance on a good one, like a Tesla K80. I would like to do this because then I can work from home a bit using the cheap card and when I need to do serious work move my code to my works K80. If there are any issues in general you can think of I would love to hear them.
Thank you,

No issues if you’re careful.

You mention two cards with different architectures and significantly different amounts of registers and shared memory per multiprocessor.

These differences are large enough that I would probably want to exploit them!

If I’m developing a kernel that needs to run across a wide range of NVIDIA GPU architectures then my .cu file usually has a preamble that declares or calculates all the architecture-specific magic numbers that might be required.

For example, you might want to tune how much shared memory each warp or block is using for a particular architecture and then compute launch bounds based on how many blocks can “fit” into a multiprocessor as a function of available shared memory as well as architectural limits on how many simultaneous warps can be executed.

Finally, you’re going to want to build for a variety of architectures but realize this can make your build times longer.

-gencode=arch=compute_20,code=sm_21     \
-gencode=arch=compute_30,code=sm_30     \
-gencode=arch=compute_50,code=sm_50     \
-gencode=arch=compute_52,code=sm_52     \
-gencode=arch=compute_61,code=sm_61     \

The snippet above is overkill… probably just stick with the architectures you plan to run on and maybe include a virtual architecture (compute_xx) or two if you think you’ll need to run on an unexpected device.

And here’s a skeleton for a warp-centric kernel preamble:


#if   (__CUDA_ARCH__ >= 520)

// Maxwell v2 -- 64K registers and 3072 bytes per warp
#define WARPS_PER_BLOCK               1
#define LAUNCH_BOUNDS_BLOCKS          __min((98304/sizeof(wss)), __min(32, (64 / WARPS_PER_BLOCK)))

#elif (__CUDA_ARCH__ >= 500)

// Maxwell v1 -- 64K registers and 2048 bytes per warp
#define WARPS_PER_BLOCK               1
#define LAUNCH_BOUNDS_BLOCKS          __min((65536/sizeof(wss)), __min(32, (64 / WARPS_PER_BLOCK)))

#elif ((__CUDA_ARCH__ >= 350) || (__CUDA_ARCH__ == 300))

// Kepler (discrete) -- 64K registers and 1536 bytes per warp
#define WARPS_PER_BLOCK               2
#define LAUNCH_BOUNDS_BLOCKS          __min((49152/sizeof(wss)), __min(16, (64 / WARPS_PER_BLOCK)))

#elif (__CUDA_ARCH__ == 320)

// GK20A 'K1' SoC -- 32K registers and 3072 bytes per warp
#define WARPS_PER_BLOCK               1
#define LAUNCH_BOUNDS_BLOCKS          __min((49152/sizeof(wss)), __min(16, (64 / WARPS_PER_BLOCK)))

#elif (__CUDA_ARCH__ >= 200)

// Fermi -- 32K registers and 3072 bytes per warp
#define WARPS_PER_BLOCK               2
#define LAUNCH_BOUNDS_BLOCKS          __min((49152/sizeof(wss)), __min(8,  (48 / WARPS_PER_BLOCK)))


#define WARPS_PER_BLOCK               999
#define LAUNCH_BOUNDS_BLOCKS          999



static __shared__ struct warp_state wss[WARPS_PER_BLOCK];


Trim to fit your requirements and multiprocessor capabilities:

allanmac’s macro-based approach is actually what’s used in the CUB library which I thought was pretty neat. Apparently, they got a wicked fast prefix sum implementation.

Thank you! This was very helpful, would this mean that I would be better off getting a GTX 760 as it also has Kepler architecture?

No, I think these days you’re better off developing on an sm_52 or newer device (like your GTX 950) unless you’re really really really trying to tune for the Kepler architecture in some special way… but also realize the K80 is kind of an odd uber-compute GPU with its huge register files and extra shared memory.

IIRC the GTX 950 has better profiling and debugging support than any previous architecture.

I prefer to debug on whatever is my newest GPU but profile on my smallest GPU since improvements/regressions are more evident.

That’s good to know. You will have to forgive my ignorance, I am just starting to look at programming on CUDA C. I have an application that includes optimizing surfaces using thousands of control points so I am trying to decide what the best way to go about that is. I know C for the most part and as it was described to me CUDA C is essentially just add on function to C that take advantage of the highly paralleled nature of GPUs. Do you think something like the K80 is a good choice or am I looking at this all wrong?
Thanks again!

The K80 is an expensive dual-GPU card that is probably only useful if you already have access to one.

You might be surprised at the performance of your GTX 950. It has a lot of GFLOPS and the performance you obtain from your kernels should hopefully scale up with larger devices in the Maxwell family or Pascal 10-series family.

I should mention that I don’t actually have a 950 yet, that was just my first thought when I thought of cheap Nvidia GPUs. I should also mention this is for a University dissertation so if it turns out that a certain card will be best we can probably get it.

Ah, the new $260 GTX 1060 Pascal GPU is an embarrassing amount of compute power for the money.

And, assuming you don’t need FP64 support, a $1200 TITAN X Pascal GPU is arguably the world’s fastest PCIe GPU.

Great! Thank you very much, I am sure I will be hitting the forums again soon but this should get me started.