Here’s the email I got from nVidia. I’ve spilt the input so I worked around the bug.
" Bug action changed from “Dev - Open - To fix” to “QA - Open - Verify to close”
Please do not reply to this message
-------------------------------------------------------------------
The following items have been modified for this Bug:
- Bug disposition changed from "Open issue" to "Bug - Fixed"
- Bug action changed from "Dev - Open - To fix" to "QA - Open - Verify to close"
-------------------------------------------------------------------
Bug Information
-------------------------------------------------------------------
Requester: Eyal
Customer Bug ID:
NVIDIA Bug ID: 582591
Date: 8/3/2009 7:49:55 AM
Company/Division: GPU Computing
Severity: High
Priority: 1 - High
Synopsis: Problem accessing textures bound to huge arrays (>2GB) (C1060)
Description: Hi,
My input is usually very big and I use textures to access it from the kernel. When the input is ~2GB of data (I use C1060) the texture access gives me back faulty results when I try to access high positions in the texture.
Below is a test code and 4 output scenarios.
Any ideas whats going on?
global void textureKernel( unsigned int iPos, float *fData_d, float *fOutputData1, float *fOutputData2 ) {
for ( unsigned int i = 0; i < PARAM_COUNT; i++ )
{
unsigned ii = iPos + i;
float2 fValue = tex1Dfetch( tex_LargeFloat2, ii );
fOutputData1[ i ] = fValue.x;
fOutputData2[ i ] = fValue.y;
cudaSetDevice(1); // Use the C1060.
unsigned int iDeviceFreeMem = 0, iDeviceTotalMem = 0;
unsigned int iTextureOffset = 69872990; //57,935,200;
unsigned int iSamples = 2001;
unsigned int iSize = 2 * 139676;
iSize *= iSamples; // iSize = 558,983,352
float *fDummy;
float *fData_h = new float[ iSize ];
float *fData_d = NULL;
float *fOutput_d1 = NULL;
float *fOutput_d2 = NULL;
float *fOutput_h1 = ( float * )malloc( PARAM_COUNT * sizeof( float ) );
float *fOutput_h2 = ( float * )malloc( PARAM_COUNT * sizeof( float ) );
printf( "Size: [%u]\n", iSize );
for ( unsigned int i = 0; i < iSize; i++ )
fData_h[ i ] = i * 1.f;
cudaMalloc( ( void ** )&( fDummy ), sizeof( float ) );
GPU_CHECK_ERR();
cuMemGetInfo( &iDeviceFreeMem, &iDeviceTotalMem );
printf( "There should be [%u] pairs of float2.\n", iSize / 2 );
printf( "Preparing data for the kernel, memory available: [%u/%u]\n", iDeviceFreeMem, iDeviceTotalMem );
cudaMalloc( ( void ** )&( fData_d ), iSize * sizeof( float ) );
GPU_CHECK_ERR();
// Copy the actual data from the host pointer to the device pointer (cudaMemcpyHostToDevice).
cudaMemcpy( fData_d, fData_h, iSize * sizeof( float ), cudaMemcpyHostToDevice );
cudaMalloc( ( void ** )&( fOutput_d1 ), PARAM_COUNT * sizeof( float ) );
cudaMemset( fOutput_d1, 0, PARAM_COUNT * sizeof( float ) );
GPU_CHECK_ERR();
cudaMalloc( ( void ** )&( fOutput_d2 ), PARAM_COUNT * sizeof( float ) );
cudaMemset( fOutput_d2, 0, PARAM_COUNT * sizeof( float ) );
GPU_CHECK_ERR();
CUDA_SAFE_CALL( cudaBindTexture( 0, tex_LargeFloat2, fData_d, iSize ) );
cuMemGetInfo( &iDeviceFreeMem, &iDeviceTotalMem );
printf( "Calling the kernel, memory available: [%u/%u]\n", iDeviceFreeMem, iDeviceTotalMem );
textureKernel<<< 1, 1 >>>( iTextureOffset, fData_d, fOutput_d1, fOutput_d2 );
GPU_CHECK_ERR();
printf( "Copying output from the kernel\n" );
cudaMemcpy( fOutput_h1, &( fOutput_d1[ 0 ] ), PARAM_COUNT * sizeof( float ), cudaMemcpyDeviceToHost );
cudaMemcpy( fOutput_h2, &( fOutput_d2[ 0 ] ), PARAM_COUNT * sizeof( float ), cudaMemcpyDeviceToHost );
GPU_CHECK_ERR();
printf( "Kernel results - Offset [%u]:\n", iTextureOffset );
for( int i = 0; i < PARAM_COUNT; i++ )
printf( "[%d]: [%.3f, %.3f] vs [%.3f, %.3f]\n", i, fOutput_h1[ i ], fOutput_h2[ i ], fData_h[ 2 * ( iTextureOffset + i ) ], fData_h[ 2 * ( iTextureOffset + i ) + 1 ] );
// Now clean up everything ….
cudaFree( fData_d ); cudaFree( fOutput_d1 ); cudaFree( fOutput_d2 );
free( fData_h ); free( fOutput_h1 ); free( fOutput_h2 );
And here are 4 outputs using different values for the <b>iTextureOffset</b> parameter (first column is data from the kernel and second column is from the host):
// All great…
Size: [558983352]
There should be [279491676] pairs of float2.
Preparing data for the kernel, memory available: [4246142976/4294705152] Calling the kernel, memory available: [2010206208/4294705152] Copying output from the kernel Kernel results - Offset [0]:
[0]: [0.000, 1.000] vs [0.000, 1.000]
[1]: [2.000, 3.000] vs [2.000, 3.000]
[2]: [4.000, 5.000] vs [4.000, 5.000]
[3]: [6.000, 7.000] vs [6.000, 7.000]
[4]: [8.000, 9.000] vs [8.000, 9.000]
[5]: [10.000, 11.000] vs [10.000, 11.000]
[6]: [12.000, 13.000] vs [12.000, 13.000]
[7]: [14.000, 15.000] vs [14.000, 15.000]
[8]: [16.000, 17.000] vs [16.000, 17.000]
[9]: [18.000, 19.000] vs [18.000, 19.000]
// This one is still ok…
Size: [558983352]
There should be [279491676] pairs of float2.
Preparing data for the kernel, memory available: [4246142976/4294705152] Calling the kernel, memory available: [2010206208/4294705152] Copying output from the kernel Kernel results - Offset [69872900]:
[0]: [139745792.000, 139745808.000] vs [139745792.000, 139745808.000]
[1]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]
[2]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]
[3]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]
[4]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]
[5]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]
[6]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]
[7]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]
[8]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]
[9]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]
// Last item is faulty…
Size: [558983352]
There should be [279491676] pairs of float2.
Preparing data for the kernel, memory available: [4246142976/4294705152] Calling the kernel, memory available: [2010206208/4294705152] Copying output from the kernel Kernel results - Offset [69872910]:
[0]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]
[1]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]
[2]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]
[3]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]
[4]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]
[5]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]
[6]: [139745824.000, 139745840.000] vs [139745824.000, 139745840.000]
[7]: [139745840.000, 139745840.000] vs [139745840.000, 139745840.000]
[8]: [139745840.000, 139745840.000] vs [139745840.000, 139745840.000]
[9]: [0.000, 0.000] vs [139745840.000, 139745840.000]
// All items are faulty…
Size: [558983352]
There should be [279,491,676] pairs of float2.
Preparing data for the kernel, memory available: [4246142976/4294705152] Calling the kernel, memory available: [2010206208/4294705152] Copying output from the kernel Kernel results - Offset [69,872,990]:
[0]: [0.000, 0.000] vs [139745984.000, 139745984.000]
[1]: [0.000, 0.000] vs [139745984.000, 139745984.000]
[2]: [0.000, 0.000] vs [139745984.000, 139745984.000]
[3]: [0.000, 0.000] vs [139745984.000, 139745984.000]
[4]: [0.000, 0.000] vs [139745984.000, 139745984.000]
[5]: [0.000, 0.000] vs [139745984.000, 139745984.000]
[6]: [0.000, 0.000] vs [139745984.000, 139746000.000]
[7]: [0.000, 0.000] vs [139746000.000, 139746000.000]
[8]: [0.000, 0.000] vs [139746000.000, 139746000.000]
[9]: [0.000, 0.000] vs [139746000.000, 139746000.000]
-------------------- Additional Information ------------------------ Computer Type: PC System Model Type:
System Model Number:
CPU Type:
Video Memory Type:
Chipset Mfg:
Chipset Type:
Sound Card:
CPU Speed:
Network:
Modem:
North Bridge:
South Bridge:
TV Encoder:
Bus Type: AGP
OS Language:
Application:
Driver Version: 190
System BIOS Version:
Video BIOS Mfg:
Video BIOS Version:
Direct X Version:
Monitor Type:
Monitor 1:
Monitor 2:
Monitor 3:
Video 1:
Video 2:
Video 3:
Resolution:
Color Depth:
Products: other
Application Version:
Application Setting:
Multithreaded Application:
Other open applications:
Release:
OS Details:
Problem Category:
How often does problem occur:
Video Memory Size:
CPUs (single or multi):
RAM (amount & type):
AGP Aperture Size:
-------------------------------------------------------------------
Latest Comment update from NVIDIA (8/3/2009 5:06:04 PM):
Will you attach a self contained example and information on the development environment you're using, in particular any special build options or compiler flags? Where is GPU_CHECK_ERR defined? We'll need to be able to reproduce this locally for further investigation.