Copy to constant memory fails

sharangad · August 24, 2010, 5:02pm

I have a block of code which fails in two different ways on two different machines:

constant unsigned char colourPalette[MAX_COLOUR_INDICES];

host void InitColours( void )
{
unsigned char *tempColours = new unsigned char[MAX_COLOUR_INDICES];

for( int i = 0; i < MAX_COLOUR_INDICES; ++i )
{

	tempColours[index] = rand()%256;
}


cout << "Memcpy returned " << cudaMemcpyToSymbol( "colourPalette", tempColours, MAX_COLOUR_INDICES, 0, cudaMemcpyHostToDevice ) << endl;
delete [] tempColours;

}

On Machine 1 this compiles and produces error code 13 (cudaErrorInvalidSymbol) Geforce 9600 M

On Machine 2 this does not compile, producing cryptic macro errors like not enough arguments and shadow_val failures. GTX460 and a GT240

I even tried changing the memcpy to be:

cout << "Memcpy returned " << cudaMemcpyToSymbol( colourPalette, tempColours, MAX_COLOUR_INDICES, 0, cudaMemcpyHostToDevice ) << endl;

!help

Cliff_Woolley · August 24, 2010, 6:58pm

This most likely indicates that you don’t have device code compiled in that supports the correct compute capability level. constant blocks are compiled in with the device code, so if there’s no device code that matches your device, the address of the symbol cannot be retrieved for uploading the data (because no device code was loaded onto the device so the address doesn’t exist).

Not sure right off on this one without seeing the actual error messages.

Can you paste in the exact error messages as well as the compiler flags used to build this code?

Thanks,

Cliff

Cliff_Woolley · August 24, 2010, 6:58pm

This most likely indicates that you don’t have device code compiled in that supports the correct compute capability level. constant blocks are compiled in with the device code, so if there’s no device code that matches your device, the address of the symbol cannot be retrieved for uploading the data (because no device code was loaded onto the device so the address doesn’t exist).

Not sure right off on this one without seeing the actual error messages.

Can you paste in the exact error messages as well as the compiler flags used to build this code?

Thanks,

Cliff

sharangad · August 24, 2010, 9:58pm

Hi

I updated to the latest CUDA SDK (June 2010) and it magically compiled on the desktop machine.

I think the you might be right about the CUDA compute level. I have two different codepaths using single and double precision.

Is there any way to have one binary which can target compute 1.3 and 1.0 devices?

sharangad · August 24, 2010, 9:58pm

Hi

I updated to the latest CUDA SDK (June 2010) and it magically compiled on the desktop machine.

I think the you might be right about the CUDA compute level. I have two different codepaths using single and double precision.

Is there any way to have one binary which can target compute 1.3 and 1.0 devices?

Cliff_Woolley · August 24, 2010, 10:08pm

Okay, good to know.

Yep. You can specify multiple versions simultaneously using several -gencode options to nvcc. Section 1.3 of the Fermi Compatibility Guide (linked from http://developer.nvidia.com/object/gpucomputing.html ) shows how to do this, although that example shows sm_10 and sm_20 together, but the same concept applies to compiling sm_10 and sm_13 together. It would end up looking something like this:

nvcc -gencode=arch=compute_10,code=sm_10 -gencode=arch=compute_10,code=compute_10 -gencode=arch=compute_13,code=sm_13 -gencode=arch=compute_13,code=compute_13 ...

The “code=compute_xx” portions build PTX; the “code=sm_xx” portions build CUBIN-style machine code. You can combine these for brevity:

nvcc -gencode=arch=compute_10,code=\"compute_10,sm_10\" -gencode=arch=compute_13,code=\"compute_13,sm_13\" ...

–Cliff

Cliff_Woolley · August 24, 2010, 10:08pm

Okay, good to know.

Yep. You can specify multiple versions simultaneously using several -gencode options to nvcc. Section 1.3 of the Fermi Compatibility Guide (linked from http://developer.nvidia.com/object/gpucomputing.html ) shows how to do this, although that example shows sm_10 and sm_20 together, but the same concept applies to compiling sm_10 and sm_13 together. It would end up looking something like this:

nvcc -gencode=arch=compute_10,code=sm_10 -gencode=arch=compute_10,code=compute_10 -gencode=arch=compute_13,code=sm_13 -gencode=arch=compute_13,code=compute_13 ...

The “code=compute_xx” portions build PTX; the “code=sm_xx” portions build CUBIN-style machine code. You can combine these for brevity:

nvcc -gencode=arch=compute_10,code=\"compute_10,sm_10\" -gencode=arch=compute_13,code=\"compute_13,sm_13\" ...

–Cliff

sharangad · August 24, 2010, 10:47pm

Thanks! That’s sorted things out for me.

One other thing. When a kernel takes too long to run, Windows Vista/7 kills the display driver. Is there any way to prevent that?

sharangad · August 24, 2010, 10:47pm

Thanks! That’s sorted things out for me.

One other thing. When a kernel takes too long to run, Windows Vista/7 kills the display driver. Is there any way to prevent that?

Cliff_Woolley · August 24, 2010, 10:50pm

Great! Glad to hear it.

This is due to the mechanism in WinVista/Win7’s WDDM display driver model called “Timeout Detection and Recovery” (TDR for short). See http://www.microsoft.com/whdc/device/displ…dm_timeout.mspx for details.

Starting in Vista, Microsoft added this mechanism to detect display drivers that have not responded for some period (two seconds by default) and recover (reset and reload) them as seamlessly for the user as possible. There was a similar timeout mechanism in XP, but the timeout period was longer (~7 secs, if I recall correctly), and the only “recovery” mechanism was to bugcheck (bluescreen). By adding the recovery feature and by shortening the timeout period, the user experience in the case of broken display drivers would be improved, since an unresponsive display driver appears to the user as a hung system.

Unfortunately this means that compute kernels have to complete within this timeout period as well,* or Windows will assume that the driver has hung and will reset it. After such a reset occurs, any allocated device memory or cuda contexts are gone, and you have to start from scratch.

Unless you’re using our TCC (Tesla Compute Cluster) (non-display) driver, in which case the timeout period does not apply.

If youâ€™re not using TCC, the solution is to either increase the timeout period to something you’re unlikely to hit with a compute kernel (I usually pick 30 seconds or more) or to disable the timeout entirely (although if you do, then have a kernel that goes into an infinite loop or something, your machine will appear to be more or less hung). The Microsoft webpage linked above shows the registry keys you can use to change these settings. Note that you have to reboot for changes to the regkeys to take effect.

–Cliff

Cliff_Woolley · August 24, 2010, 10:50pm

Great! Glad to hear it.

This is due to the mechanism in WinVista/Win7’s WDDM display driver model called “Timeout Detection and Recovery” (TDR for short). See http://www.microsoft.com/whdc/device/displ…dm_timeout.mspx for details.

Starting in Vista, Microsoft added this mechanism to detect display drivers that have not responded for some period (two seconds by default) and recover (reset and reload) them as seamlessly for the user as possible. There was a similar timeout mechanism in XP, but the timeout period was longer (~7 secs, if I recall correctly), and the only “recovery” mechanism was to bugcheck (bluescreen). By adding the recovery feature and by shortening the timeout period, the user experience in the case of broken display drivers would be improved, since an unresponsive display driver appears to the user as a hung system.

Unfortunately this means that compute kernels have to complete within this timeout period as well,* or Windows will assume that the driver has hung and will reset it. After such a reset occurs, any allocated device memory or cuda contexts are gone, and you have to start from scratch.

Unless you’re using our TCC (Tesla Compute Cluster) (non-display) driver, in which case the timeout period does not apply.

If youâ€™re not using TCC, the solution is to either increase the timeout period to something you’re unlikely to hit with a compute kernel (I usually pick 30 seconds or more) or to disable the timeout entirely (although if you do, then have a kernel that goes into an infinite loop or something, your machine will appear to be more or less hung). The Microsoft webpage linked above shows the registry keys you can use to change these settings. Note that you have to reboot for changes to the regkeys to take effect.

–Cliff

sharangad · August 24, 2010, 10:51pm

The errors I got were very similar to the following thread

http://forums.nvidia.com/index.php?showtopic=107517

warning C4003: not enough actual parameters for macro ‘loc_sc’

warning C4003: not enough actual parameters for macro ‘__shadow_var’

I just compiled the code without any options.

sharangad · August 24, 2010, 10:51pm

The errors I got were very similar to the following thread

http://forums.nvidia.com/index.php?showtopic=107517

warning C4003: not enough actual parameters for macro ‘loc_sc’

warning C4003: not enough actual parameters for macro ‘__shadow_var’

I just compiled the code without any options.