Cuda Portability and SharedMem vs Cache

cudesnick · October 14, 2010, 9:45pm

Hello.

First of all, this is NOT about portability between CPU and GPU or between ATI and nVidia. This is ONLY about portability between different CUDA versions and different nVidia architectures.

My CUDA-based code runs on GTX200 architecture. I would like to run the same source code (but possibly a different executable) on Fermi.

I would also like the same source code to compile on both CUDA 2.x and CUDA 3.x, at least for the time being.

I have a few questions:

Do I need to compile the code with different flags, so that it utilizes Fermi more efficiently? I don’t need any Fermi features, that are not available on GTX200. I don’t need double precision. Of course, I’d prefer to have the same build flags for both GTX200 and Fermi, unless performance is worse on Fermi under generic flags.
I would like to configure the device to offer more shared memory at the expense of cache. I found function cudaThreadSetCacheConfig in CUDA 3.2 that seems relevant to my task. As far as I understand, that function is not available in earlier versions of CUDA.

Since cudaThreadSetCacheConfig(…) is not in CUDA 2.x, and I need to compile within that framework, I imagine adding something like

#if CUDA_VERISON > 3.0

MY_OWN_ERROR_CHECKING_MACRO(cudaThreadSetCacheConfig(cudaFun
cCachePreferShared));

#endif

to the initialization section of my code, e.g. where I invoke cudaSetDeviceFlags(cudaDeviceBlockingSync) [BTW, does the order of these two invocations matter?]

What macro should I use to determine that CUDA version is above 3.0?
Is the call to cudaThreadSetCacheConfig valid on pre-Fermi architectures in CUDA 3.x environments? Should I test for the device properties before this call?
The doc for cudaThreadSetCacheConfig states that the GPU does not guarantee that the requested shared memory/cache policy is honored, even on Fermi. Is there a stricter function, that actually guarantees that the device’s shared/cache memory is utilized the way I want, or returns an error?

Thanks!

cudesnick · October 14, 2010, 9:45pm

Hello.

First of all, this is NOT about portability between CPU and GPU or between ATI and nVidia. This is ONLY about portability between different CUDA versions and different nVidia architectures.

My CUDA-based code runs on GTX200 architecture. I would like to run the same source code (but possibly a different executable) on Fermi.

I would also like the same source code to compile on both CUDA 2.x and CUDA 3.x, at least for the time being.

I have a few questions:

Do I need to compile the code with different flags, so that it utilizes Fermi more efficiently? I don’t need any Fermi features, that are not available on GTX200. I don’t need double precision. Of course, I’d prefer to have the same build flags for both GTX200 and Fermi, unless performance is worse on Fermi under generic flags.
I would like to configure the device to offer more shared memory at the expense of cache. I found function cudaThreadSetCacheConfig in CUDA 3.2 that seems relevant to my task. As far as I understand, that function is not available in earlier versions of CUDA.

Since cudaThreadSetCacheConfig(…) is not in CUDA 2.x, and I need to compile within that framework, I imagine adding something like

#if CUDA_VERISON > 3.0

MY_OWN_ERROR_CHECKING_MACRO(cudaThreadSetCacheConfig(cudaFun
cCachePreferShared));

#endif

to the initialization section of my code, e.g. where I invoke cudaSetDeviceFlags(cudaDeviceBlockingSync) [BTW, does the order of these two invocations matter?]

What macro should I use to determine that CUDA version is above 3.0?
Is the call to cudaThreadSetCacheConfig valid on pre-Fermi architectures in CUDA 3.x environments? Should I test for the device properties before this call?
The doc for cudaThreadSetCacheConfig states that the GPU does not guarantee that the requested shared memory/cache policy is honored, even on Fermi. Is there a stricter function, that actually guarantees that the device’s shared/cache memory is utilized the way I want, or returns an error?

Thanks!

AlexanderMalishev · October 14, 2010, 11:16pm

Hello.

First of all, this is NOT about portability between CPU and GPU or between ATI and nVidia. This is ONLY about portability between different CUDA versions and different nVidia architectures.

My CUDA-based code runs on GTX200 architecture. I would like to run the same source code (but possibly a different executable) on Fermi.

I would also like the same source code to compile on both CUDA 2.x and CUDA 3.x, at least for the time being.

I have a few questions:

Do I need to compile the code with different flags, so that it utilizes Fermi more efficiently? I don’t need any Fermi features, that are not available on GTX200. I don’t need double precision. Of course, I’d prefer to have the same build flags for both GTX200 and Fermi, unless performance is worse on Fermi under generic flags.

I would like to configure the device to offer more shared memory at the expense of cache. I found function cudaThreadSetCacheConfig in CUDA 3.2 that seems relevant to my task. As far as I understand, that function is not available in earlier versions of CUDA.

The doc for cudaThreadSetCacheConfig states that the GPU does not guarantee that the requested shared memory/cache policy is honored, even on Fermi. Is there a stricter function, that actually guarantees that the device’s shared/cache memory is utilized the way I want, or returns an error?

Thanks!

No. Use “-arch=compute_10 -code=compute_10” options (Or compute_13 if you need some compute_13 features). In this case actual code generation is done by device driver.
The default cache configuration is 48 KB of shared memory and 16 KB of L1 cache. You could leave it as is.

You could check cache config with cudaFuncGetCacheConfig.

AlexanderMalishev · October 14, 2010, 11:16pm

Hello.

First of all, this is NOT about portability between CPU and GPU or between ATI and nVidia. This is ONLY about portability between different CUDA versions and different nVidia architectures.

My CUDA-based code runs on GTX200 architecture. I would like to run the same source code (but possibly a different executable) on Fermi.

I would also like the same source code to compile on both CUDA 2.x and CUDA 3.x, at least for the time being.

I have a few questions:

Do I need to compile the code with different flags, so that it utilizes Fermi more efficiently? I don’t need any Fermi features, that are not available on GTX200. I don’t need double precision. Of course, I’d prefer to have the same build flags for both GTX200 and Fermi, unless performance is worse on Fermi under generic flags.

I would like to configure the device to offer more shared memory at the expense of cache. I found function cudaThreadSetCacheConfig in CUDA 3.2 that seems relevant to my task. As far as I understand, that function is not available in earlier versions of CUDA.

The doc for cudaThreadSetCacheConfig states that the GPU does not guarantee that the requested shared memory/cache policy is honored, even on Fermi. Is there a stricter function, that actually guarantees that the device’s shared/cache memory is utilized the way I want, or returns an error?

Thanks!

No. Use “-arch=compute_10 -code=compute_10” options (Or compute_13 if you need some compute_13 features). In this case actual code generation is done by device driver.
The default cache configuration is 48 KB of shared memory and 16 KB of L1 cache. You could leave it as is.

You could check cache config with cudaFuncGetCacheConfig.

cudesnick · October 15, 2010, 2:49pm

Thanks, Alexander.

Doing nothing sounds like the way to go.

cudesnick · October 15, 2010, 2:49pm

Thanks, Alexander.

Doing nothing sounds like the way to go.

Lev · October 15, 2010, 3:32pm

I suppose you need to compile with -arch=compute_20 -code=compute _20 or something like it see manual, to generate optimized ptx file for compute 2.0. You can specify a few different targets, so a few different ptx files will be generated, and driver would use appropriate. Fermi, btw, needs tuning in block size register count etc. Other way it could have even slower performance that GT200.

Lev · October 15, 2010, 3:32pm

I suppose you need to compile with -arch=compute_20 -code=compute _20 or something like it see manual, to generate optimized ptx file for compute 2.0. You can specify a few different targets, so a few different ptx files will be generated, and driver would use appropriate. Fermi, btw, needs tuning in block size register count etc. Other way it could have even slower performance that GT200.

cudesnick · October 18, 2010, 11:21pm

Is there a way to run cudaFuncGetCacheConfig() from cuda-gdb, having attached the debugger to a running cuda app?

Thanks!

cudesnick · October 18, 2010, 11:21pm

Is there a way to run cudaFuncGetCacheConfig() from cuda-gdb, having attached the debugger to a running cuda app?

Thanks!