Weird behaviour with large data and textures

Hi,

My input is usually very big and I use textures to access it from the kernel. When the input is ~2GB of data (I use C1060) the texture access gives me back faulty results when

I try to access high positions in the texture.

Below is a test code and 4 output scenarios.

Any ideas whats going on?

Thanks…

#define PARAM_COUNT 10

texture<float2, 1, cudaReadModeElementType> tex_LargeFloat2;

__global__ void textureKernel( unsigned int iPos, float *fData_d, float *fOutputData1, float *fOutputData2 )

{

	for ( unsigned int i = 0; i < PARAM_COUNT; i++ )

	{

		unsigned ii = iPos + i;

		float2 fValue = tex1Dfetch( tex_LargeFloat2, ii );

		fOutputData1[ i ] = fValue.x;

		fOutputData2[ i ] = fValue.y;

//		fOutputData1[ i ] = fData_d[ ii * 2 ];

//		fOutputData2[ i ] = fData_d[ ii * 2 + 1 ];

	}

}

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int

main(int argc, char** argv) 

{

	cudaSetDevice(1); // Use the C1060.

	unsigned int iDeviceFreeMem = 0, iDeviceTotalMem = 0;

	unsigned int iTextureOffset = 69872990; //57,935,200;

	unsigned int iSamples = 2001;  

	unsigned int iSize = 2 * 139676; 

	iSize *= iSamples;		// iSize = 558,983,352

	float *fDummy;

	float *fData_h = new float[ iSize ];

	float *fData_d = NULL;

	float *fOutput_d1 = NULL;

	float *fOutput_d2 = NULL;

	float *fOutput_h1 = ( float * )malloc( PARAM_COUNT * sizeof( float ) );

	float *fOutput_h2 = ( float * )malloc( PARAM_COUNT * sizeof( float ) );

	printf( "Size: [%u]\n", iSize );

	for ( unsigned int i = 0; i < iSize; i++ )

		fData_h[ i ] = i * 1.f;

	cudaMalloc( ( void ** )&( fDummy ), sizeof( float ) );

	GPU_CHECK_ERR();

	cuMemGetInfo( &iDeviceFreeMem, &iDeviceTotalMem );

	printf( "There should be [%u] pairs of float2.\n", iSize / 2 );

	printf( "Preparing data for the kernel, memory available: [%u/%u]\n", iDeviceFreeMem, iDeviceTotalMem );

	cudaMalloc( ( void ** )&( fData_d ), iSize * sizeof( float ) );

	GPU_CHECK_ERR();

	// Copy the actual data from the host pointer to the device pointer (cudaMemcpyHostToDevice).

	cudaMemcpy( fData_d, fData_h, iSize * sizeof( float ), cudaMemcpyHostToDevice );

	cudaMalloc( ( void ** )&( fOutput_d1 ), PARAM_COUNT * sizeof( float ) );

	cudaMemset( fOutput_d1, 0, PARAM_COUNT * sizeof( float ) );

	GPU_CHECK_ERR();

	cudaMalloc( ( void ** )&( fOutput_d2 ), PARAM_COUNT * sizeof( float ) );

	cudaMemset( fOutput_d2, 0, PARAM_COUNT * sizeof( float ) );

	GPU_CHECK_ERR();

	CUDA_SAFE_CALL( cudaBindTexture( 0, tex_LargeFloat2, fData_d, iSize ) );

	

	cuMemGetInfo( &iDeviceFreeMem, &iDeviceTotalMem );

	printf( "Calling the kernel, memory available: [%u/%u]\n", iDeviceFreeMem, iDeviceTotalMem );

	textureKernel<<< 1, 1 >>>( iTextureOffset, fData_d, fOutput_d1, fOutput_d2 );

	GPU_CHECK_ERR();

	

	printf( "Copying output from the kernel\n" );

	cudaMemcpy( fOutput_h1, &( fOutput_d1[ 0 ] ), PARAM_COUNT * sizeof( float ), cudaMemcpyDeviceToHost );

	cudaMemcpy( fOutput_h2, &( fOutput_d2[ 0 ] ), PARAM_COUNT * sizeof( float ), cudaMemcpyDeviceToHost );

	GPU_CHECK_ERR();

	printf( "Kernel results - Offset [%u]:\n", iTextureOffset );

	for( int i = 0; i < PARAM_COUNT; i++ )

		printf( "[%d]: [%.3f, %.3f] vs [%.3f, %.3f]\n", i, fOutput_h1[ i ], fOutput_h2[ i ], fData_h[ 2 * ( iTextureOffset + i ) ], fData_h[ 2 * ( iTextureOffset + i ) + 1 ] );

	// Now clean up everything ….

	cudaFree( fData_d ); cudaFree( fOutput_d1 ); cudaFree( fOutput_d2 );

	free( fData_h ); free( fOutput_h1 ); free( fOutput_h2 );

And here are 4 outputs using different values for the iTextureOffset parameter

(first column is data from the kernel and second column is from the host):

// All great...

Size: [558983352]

There should be [279491676] pairs of float2.

Preparing data for the kernel, memory available: [4246142976/4294705152]

Calling the kernel, memory available: [2010206208/4294705152]

Copying output from the kernel

Kernel results - Offset [0]:

[0]: [0.000, 1.000] vs [0.000, 1.000]

[1]: [2.000, 3.000] vs [2.000, 3.000]

[2]: [4.000, 5.000] vs [4.000, 5.000]

[3]: [6.000, 7.000] vs [6.000, 7.000]

[4]: [8.000, 9.000] vs [8.000, 9.000]

[5]: [10.000, 11.000] vs [10.000, 11.000]

[6]: [12.000, 13.000] vs [12.000, 13.000]

[7]: [14.000, 15.000] vs [14.000, 15.000]

[8]: [16.000, 17.000] vs [16.000, 17.000]

[9]: [18.000, 19.000] vs [18.000, 19.000]

// This one is still ok...

Size: [558983352]

There should be [279491676] pairs of float2.

Preparing data for the kernel, memory available: [4246142976/4294705152]

Calling the kernel, memory available: [2010206208/4294705152]

Copying output from the kernel

Kernel results - Offset [69872900]:

[0]: [139745792.000, 139745808.000] vs [139745792.000, 139745808.000]

[1]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]

[2]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]	

[3]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]

[4]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]

[5]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]

[6]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]

[7]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]

[8]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[9]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

// Last item is faulty...

Size: [558983352]

There should be [279491676] pairs of float2.

Preparing data for the kernel, memory available: [4246142976/4294705152]

Calling the kernel, memory available: [2010206208/4294705152]

Copying output from the kernel

Kernel results - Offset [69872910]:

[0]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[1]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[2]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[3]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[4]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[5]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[6]: [139745824.000, 139745840.000] vs [139745824.000, 139745840.000]

[7]: [139745840.000, 139745840.000] vs [139745840.000, 139745840.000]

[8]: [139745840.000, 139745840.000] vs [139745840.000, 139745840.000]

[9]: [0.000, 0.000] vs [139745840.000, 139745840.000]

// All items are faulty...

Size: [558983352]

There should be [279,491,676] pairs of float2.

Preparing data for the kernel, memory available: [4246142976/4294705152]

Calling the kernel, memory available: [2010206208/4294705152]

Copying output from the kernel

Kernel results - Offset [69,872,990]:

[0]: [0.000, 0.000] vs [139745984.000, 139745984.000]

[1]: [0.000, 0.000] vs [139745984.000, 139745984.000]

[2]: [0.000, 0.000] vs [139745984.000, 139745984.000]

[3]: [0.000, 0.000] vs [139745984.000, 139745984.000]

[4]: [0.000, 0.000] vs [139745984.000, 139745984.000]

[5]: [0.000, 0.000] vs [139745984.000, 139745984.000]

[6]: [0.000, 0.000] vs [139745984.000, 139746000.000]

[7]: [0.000, 0.000] vs [139746000.000, 139746000.000]

[8]: [0.000, 0.000] vs [139746000.000, 139746000.000]

[9]: [0.000, 0.000] vs [139746000.000, 139746000.000]