Cuda Portability and SharedMem vs Cache

Hello.

First of all, this is NOT about portability between CPU and GPU or between ATI and nVidia. This is ONLY about portability between different CUDA versions and different nVidia architectures.

My CUDA-based code runs on GTX200 architecture. I would like to run the same source code (but possibly a different executable) on Fermi.

I would also like the same source code to compile on both CUDA 2.x and CUDA 3.x, at least for the time being.

I have a few questions:

  1. Do I need to compile the code with different flags, so that it utilizes Fermi more efficiently? I don’t need any Fermi features, that are not available on GTX200. I don’t need double precision. Of course, I’d prefer to have the same build flags for both GTX200 and Fermi, unless performance is worse on Fermi under generic flags.

  2. I would like to configure the device to offer more shared memory at the expense of cache. I found function cudaThreadSetCacheConfig in CUDA 3.2 that seems relevant to my task. As far as I understand, that function is not available in earlier versions of CUDA.

Since cudaThreadSetCacheConfig(…) is not in CUDA 2.x, and I need to compile within that framework, I imagine adding something like

#if CUDA_VERISON > 3.0

MY_OWN_ERROR_CHECKING_MACRO(cudaThreadSetCacheConfig(cudaFun
cCachePreferShared));

#endif

to the initialization section of my code, e.g. where I invoke cudaSetDeviceFlags(cudaDeviceBlockingSync) [BTW, does the order of these two invocations matter?]

  • What macro should I use to determine that CUDA version is above 3.0?

  • Is the call to cudaThreadSetCacheConfig valid on pre-Fermi architectures in CUDA 3.x environments? Should I test for the device properties before this call?

  • The doc for cudaThreadSetCacheConfig states that the GPU does not guarantee that the requested shared memory/cache policy is honored, even on Fermi. Is there a stricter function, that actually guarantees that the device’s shared/cache memory is utilized the way I want, or returns an error?

Thanks!

Hello.

First of all, this is NOT about portability between CPU and GPU or between ATI and nVidia. This is ONLY about portability between different CUDA versions and different nVidia architectures.

My CUDA-based code runs on GTX200 architecture. I would like to run the same source code (but possibly a different executable) on Fermi.

I would also like the same source code to compile on both CUDA 2.x and CUDA 3.x, at least for the time being.

I have a few questions:

  1. Do I need to compile the code with different flags, so that it utilizes Fermi more efficiently? I don’t need any Fermi features, that are not available on GTX200. I don’t need double precision. Of course, I’d prefer to have the same build flags for both GTX200 and Fermi, unless performance is worse on Fermi under generic flags.

  2. I would like to configure the device to offer more shared memory at the expense of cache. I found function cudaThreadSetCacheConfig in CUDA 3.2 that seems relevant to my task. As far as I understand, that function is not available in earlier versions of CUDA.

Since cudaThreadSetCacheConfig(…) is not in CUDA 2.x, and I need to compile within that framework, I imagine adding something like

#if CUDA_VERISON > 3.0

MY_OWN_ERROR_CHECKING_MACRO(cudaThreadSetCacheConfig(cudaFun
cCachePreferShared));

#endif

to the initialization section of my code, e.g. where I invoke cudaSetDeviceFlags(cudaDeviceBlockingSync) [BTW, does the order of these two invocations matter?]

  • What macro should I use to determine that CUDA version is above 3.0?

  • Is the call to cudaThreadSetCacheConfig valid on pre-Fermi architectures in CUDA 3.x environments? Should I test for the device properties before this call?

  • The doc for cudaThreadSetCacheConfig states that the GPU does not guarantee that the requested shared memory/cache policy is honored, even on Fermi. Is there a stricter function, that actually guarantees that the device’s shared/cache memory is utilized the way I want, or returns an error?

Thanks!

  1. No. Use “-arch=compute_10 -code=compute_10” options (Or compute_13 if you need some compute_13 features). In this case actual code generation is done by device driver.

  2. The default cache configuration is 48 KB of shared memory and 16 KB of L1 cache. You could leave it as is.

You could check cache config with cudaFuncGetCacheConfig.

  1. No. Use “-arch=compute_10 -code=compute_10” options (Or compute_13 if you need some compute_13 features). In this case actual code generation is done by device driver.

  2. The default cache configuration is 48 KB of shared memory and 16 KB of L1 cache. You could leave it as is.

You could check cache config with cudaFuncGetCacheConfig.

Thanks, Alexander.

Doing nothing sounds like the way to go.

Thanks, Alexander.

Doing nothing sounds like the way to go.

I suppose you need to compile with -arch=compute_20 -code=compute _20 or something like it see manual, to generate optimized ptx file for compute 2.0. You can specify a few different targets, so a few different ptx files will be generated, and driver would use appropriate. Fermi, btw, needs tuning in block size register count etc. Other way it could have even slower performance that GT200.

I suppose you need to compile with -arch=compute_20 -code=compute _20 or something like it see manual, to generate optimized ptx file for compute 2.0. You can specify a few different targets, so a few different ptx files will be generated, and driver would use appropriate. Fermi, btw, needs tuning in block size register count etc. Other way it could have even slower performance that GT200.

Is there a way to run cudaFuncGetCacheConfig() from cuda-gdb, having attached the debugger to a running cuda app?

Thanks!

Is there a way to run cudaFuncGetCacheConfig() from cuda-gdb, having attached the debugger to a running cuda app?

Thanks!