the memory size limit for binding to a texture

Based on testing this limit seems to be (51210241024 + 3) bytes. Here is the testing code

#include <stdio.h>

#define N (512*1024*1024+ 4)

#define CUERR  do{ cudaError_t err;					\

	cudaThreadSynchronize();						\

	if ((err = cudaGetLastError()) != cudaSuccess) {			\

	  printf("ERROR: CUDA error: %s, line %d,file=%s\n", cudaGetErrorString(err), __LINE__, __FILE__); \

	  exit(err);							\

	}}while(0)

texture<float, 1, cudaReadModeElementType> mytex;

__global__ void dummyKernel()

{ 

}

int main()

{

int device = 1;

  cudaSetDevice(device);CUERR;

float* A = NULL;

cudaMalloc((void**)&A, N); CUERR;

dummyKernel<<<1,1>>>(); CUERR;

cudaBindTexture(0, mytex, A, N); CUERR;	

dummyKernel<<<1,1>>>(); CUERR;

}

I did not find any document about this limitation. The worst part is there is no error message

when the size is exceeded. Only the next kernel launch will die with “invalid arguments”

and you have no idea what’s happening especially when the next kernel is complicated.

I’m surprised you’re even able to bind a texture that large.

From appendix G.1 of the programming guide:

edit: 1 float = 4 bytes, so the limit seems to be correct

N.

Nico, thanks for the information.

So it is documented! An error message would be nice though when the limit is exceeded.

On the other hand, for int2 texture

texture<int2, 1, cudaReadModeElementType> mytex

I would expect the size limit to be 1GB but it is still 512MB. Is this the expected behavior?

Look at bug: #582591

If I remember correctly CUDA 3.0 should have fixed those limitations.

Obviously it is not fixed since I am using cuda 3.0. I really need the large memory binding in order to run my program with

large input data.

PS. I cannot find your bug report in nvidia developer web site.

Can you check if you can still find it?

Here’s the email I got from nVidia. I’ve spilt the input so I worked around the bug.

" Bug action changed from “Dev - Open - To fix” to “QA - Open - Verify to close”

Please do not reply to this message

-------------------------------------------------------------------

The following items have been modified for this Bug:

 - Bug disposition changed from "Open issue" to "Bug - Fixed"

 - Bug action changed from "Dev - Open - To fix" to "QA - Open - Verify to close"

-------------------------------------------------------------------

Bug Information

-------------------------------------------------------------------

	   Requester: Eyal 

 Customer Bug ID: 

   NVIDIA Bug ID: 582591

			Date: 8/3/2009 7:49:55 AM

Company/Division: GPU Computing

		Severity: High

		Priority: 1 - High

		Synopsis: Problem accessing textures bound to huge arrays (>2GB) (C1060)

	 Description: Hi,

  My input is usually very big and I use textures to access it from the kernel. When the input is ~2GB of data (I use C1060) the texture access gives me back faulty results when  I try to access high positions in the texture.

Below is a test code and 4 output scenarios. 

  Any ideas whats going on? 

#define PARAM_COUNT 10

texture<float2, 1, cudaReadModeElementType> tex_LargeFloat2;

global void textureKernel( unsigned int iPos, float *fData_d, float *fOutputData1, float *fOutputData2 ) {

for ( unsigned int i = 0; i < PARAM_COUNT; i++ )

{

	unsigned ii = iPos + i;

	float2 fValue = tex1Dfetch( tex_LargeFloat2, ii );

	fOutputData1[ i ] = fValue.x;

	fOutputData2[ i ] = fValue.y;

// fOutputData1[ i ] = fData_d[ ii * 2 ];

// fOutputData2[ i ] = fData_d[ ii * 2 + 1 ];

}

}

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int

main(int argc, char** argv)

{

cudaSetDevice(1); // Use the C1060.

unsigned int iDeviceFreeMem = 0, iDeviceTotalMem = 0;

unsigned int iTextureOffset = 69872990; //57,935,200;

unsigned int iSamples = 2001;  

unsigned int iSize = 2 * 139676; 

iSize *= iSamples;		// iSize = 558,983,352

float *fDummy;

float *fData_h = new float[ iSize ];

float *fData_d = NULL;

float *fOutput_d1 = NULL;

float *fOutput_d2 = NULL;

float *fOutput_h1 = ( float * )malloc( PARAM_COUNT * sizeof( float ) );

float *fOutput_h2 = ( float * )malloc( PARAM_COUNT * sizeof( float ) );

printf( "Size: [%u]\n", iSize );

for ( unsigned int i = 0; i < iSize; i++ )

	fData_h[ i ] = i * 1.f;

cudaMalloc( ( void ** )&( fDummy ), sizeof( float ) );

GPU_CHECK_ERR();

cuMemGetInfo( &iDeviceFreeMem, &iDeviceTotalMem );

printf( "There should be [%u] pairs of float2.\n", iSize / 2 );

printf( "Preparing data for the kernel, memory available: [%u/%u]\n", iDeviceFreeMem, iDeviceTotalMem );

cudaMalloc( ( void ** )&( fData_d ), iSize * sizeof( float ) );

GPU_CHECK_ERR();

// Copy the actual data from the host pointer to the device pointer (cudaMemcpyHostToDevice).

cudaMemcpy( fData_d, fData_h, iSize * sizeof( float ), cudaMemcpyHostToDevice );

cudaMalloc( ( void ** )&( fOutput_d1 ), PARAM_COUNT * sizeof( float ) );

cudaMemset( fOutput_d1, 0, PARAM_COUNT * sizeof( float ) );

GPU_CHECK_ERR();

cudaMalloc( ( void ** )&( fOutput_d2 ), PARAM_COUNT * sizeof( float ) );

cudaMemset( fOutput_d2, 0, PARAM_COUNT * sizeof( float ) );

GPU_CHECK_ERR();

CUDA_SAFE_CALL( cudaBindTexture( 0, tex_LargeFloat2, fData_d, iSize ) );



cuMemGetInfo( &iDeviceFreeMem, &iDeviceTotalMem );

printf( "Calling the kernel, memory available: [%u/%u]\n", iDeviceFreeMem, iDeviceTotalMem );

textureKernel<<< 1, 1 >>>( iTextureOffset, fData_d, fOutput_d1, fOutput_d2 );

GPU_CHECK_ERR();



printf( "Copying output from the kernel\n" );

cudaMemcpy( fOutput_h1, &( fOutput_d1[ 0 ] ), PARAM_COUNT * sizeof( float ), cudaMemcpyDeviceToHost );

cudaMemcpy( fOutput_h2, &( fOutput_d2[ 0 ] ), PARAM_COUNT * sizeof( float ), cudaMemcpyDeviceToHost );

GPU_CHECK_ERR();

printf( "Kernel results - Offset [%u]:\n", iTextureOffset );

for( int i = 0; i < PARAM_COUNT; i++ )

	printf( "[%d]: [%.3f, %.3f] vs [%.3f, %.3f]\n", i, fOutput_h1[ i ], fOutput_h2[ i ], fData_h[ 2 * ( iTextureOffset + i ) ], fData_h[ 2 * ( iTextureOffset + i ) + 1 ] );

// Now clean up everything ….

cudaFree( fData_d ); cudaFree( fOutput_d1 ); cudaFree( fOutput_d2 );

free( fData_h ); free( fOutput_h1 ); free( fOutput_h2 );

And here are 4 outputs using different values for the <b>iTextureOffset</b> parameter (first column is data from the kernel and second column is from the host):

// All great…

Size: [558983352]

There should be [279491676] pairs of float2.

Preparing data for the kernel, memory available: [4246142976/4294705152] Calling the kernel, memory available: [2010206208/4294705152] Copying output from the kernel Kernel results - Offset [0]:

[0]: [0.000, 1.000] vs [0.000, 1.000]

[1]: [2.000, 3.000] vs [2.000, 3.000]

[2]: [4.000, 5.000] vs [4.000, 5.000]

[3]: [6.000, 7.000] vs [6.000, 7.000]

[4]: [8.000, 9.000] vs [8.000, 9.000]

[5]: [10.000, 11.000] vs [10.000, 11.000]

[6]: [12.000, 13.000] vs [12.000, 13.000]

[7]: [14.000, 15.000] vs [14.000, 15.000]

[8]: [16.000, 17.000] vs [16.000, 17.000]

[9]: [18.000, 19.000] vs [18.000, 19.000]

// This one is still ok…

Size: [558983352]

There should be [279491676] pairs of float2.

Preparing data for the kernel, memory available: [4246142976/4294705152] Calling the kernel, memory available: [2010206208/4294705152] Copying output from the kernel Kernel results - Offset [69872900]:

[0]: [139745792.000, 139745808.000] vs [139745792.000, 139745808.000]

[1]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]

[2]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]

[3]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]

[4]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]

[5]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]

[6]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]

[7]: [139745808.000, 139745808.000] vs [139745808.000, 139745808.000]

[8]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[9]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

// Last item is faulty…

Size: [558983352]

There should be [279491676] pairs of float2.

Preparing data for the kernel, memory available: [4246142976/4294705152] Calling the kernel, memory available: [2010206208/4294705152] Copying output from the kernel Kernel results - Offset [69872910]:

[0]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[1]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[2]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[3]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[4]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[5]: [139745824.000, 139745824.000] vs [139745824.000, 139745824.000]

[6]: [139745824.000, 139745840.000] vs [139745824.000, 139745840.000]

[7]: [139745840.000, 139745840.000] vs [139745840.000, 139745840.000]

[8]: [139745840.000, 139745840.000] vs [139745840.000, 139745840.000]

[9]: [0.000, 0.000] vs [139745840.000, 139745840.000]

// All items are faulty…

Size: [558983352]

There should be [279,491,676] pairs of float2.

Preparing data for the kernel, memory available: [4246142976/4294705152] Calling the kernel, memory available: [2010206208/4294705152] Copying output from the kernel Kernel results - Offset [69,872,990]:

[0]: [0.000, 0.000] vs [139745984.000, 139745984.000]

[1]: [0.000, 0.000] vs [139745984.000, 139745984.000]

[2]: [0.000, 0.000] vs [139745984.000, 139745984.000]

[3]: [0.000, 0.000] vs [139745984.000, 139745984.000]

[4]: [0.000, 0.000] vs [139745984.000, 139745984.000]

[5]: [0.000, 0.000] vs [139745984.000, 139745984.000]

[6]: [0.000, 0.000] vs [139745984.000, 139746000.000]

[7]: [0.000, 0.000] vs [139746000.000, 139746000.000]

[8]: [0.000, 0.000] vs [139746000.000, 139746000.000]

[9]: [0.000, 0.000] vs [139746000.000, 139746000.000]


-------------------- Additional Information ------------------------ Computer Type: PC System Model Type: 

System Model Number: 

CPU Type: 

Video Memory Type: 

Chipset Mfg: 

Chipset Type: 

Sound Card: 

CPU Speed: 

Network: 

Modem: 

North Bridge: 

South Bridge: 

TV Encoder: 

Bus Type: AGP

OS Language: 

Application: 

Driver Version: 190

System BIOS Version: 

Video BIOS Mfg: 

Video BIOS Version: 

Direct X Version: 

Monitor Type: 

Monitor 1: 

Monitor 2: 

Monitor 3: 

Video 1: 

Video 2: 

Video 3: 

Resolution: 

Color Depth: 

Products: other

Application Version: 

Application Setting: 

Multithreaded Application: 

Other open applications: 

Release: 

OS Details: 

Problem Category: 

How often does problem occur: 

Video Memory Size: 

CPUs (single or multi): 

RAM (amount & type): 

AGP Aperture Size: 

-------------------------------------------------------------------

Latest Comment update from NVIDIA (8/3/2009 5:06:04 PM):

Will you attach a self contained example and information on the development environment you're using, in particular any special build options or compiler flags? Where is GPU_CHECK_ERR defined? We'll need to be able to reproduce this locally for further investigation.

Some update on this one:

It used to work with old driver/cuda version to bind large memory (> 512 MB) to texture. Here is some data points I collected

driver_version	 cuda_version	 status

190.32				 3.0 beta		   work

195.17				 3.0 beta		   work

195.36.15			 3.0				  fail

265.25				 3.1				   fail

For all version I ran on C1060/S1070.

Can someone from Nvidia please look into this issue? I really need to bind to > 512 MB memory in order to run large input data size

EDIT: I used

texture<float4, 1, cudaReadModeElementType> mytex;

so it should be able to bind >512 MB data