Actual max texture dimension size of Tesla C2050

lee222 · June 3, 2011, 8:01pm

Hi everyone:

I have a question on the maximum texture dimension size of Tesla C2050.
“deviceQuery” program gives the following information:

Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)

Is this size applied only to CUDA array or is it applied both linear memory and CUDA array?

As I know, the maximum width for 1D texture reference bound to the linear memory is 2^27 and thus the above numbers are only about CUDA array. On Tesla C2050, however, if I try to bind 1D texture memory to linear memory, whose width is bigger than 65536, the program sometimes run incorrectly, depending on the combination of other texture memory usage.
Of course, the error may not be related to the texture memory, but I want to know the actual texture memory limits for correct debugging.

Thanks.

njuffa · June 3, 2011, 8:18pm

The texture dimension limits shown look correct. Note that the limit of 2^27 elements applies to tex1Dfetch(). This is not a true texture access: the data is delivered via the texture cache, but is not going through the texture filtering. So it is different from tex1D(). The limits returned by deviceQuery would appear to apply to tex1D(), tex2D(), etc.

Note that the size of the underlying storage may well exceed the maximum texture dimension. A large vector or matrix can be covered with multiple textures (at appropriate offsets), each of which is restricted to the dimensions stated however.

[Later]

On checking appendix F of the Programming Guide, I see that it distinguishes between 1D textures bound to linear memory versus 1D textures bound to CUDA-arrays, and the limit stated is not 65536. I’ll seek clarification. From personal usage on C2050 I know that tex1Dfetch() is limited to 2^27 elements, and that 2D textures are limited to 65536x65535 regardless of whether they are mapped to (pitch-)linear memory or CUDA arrays.

lee222 · June 3, 2011, 9:30pm

Then, is there any limit on the number of texture fetches using tex1Dfetch() per kernel?
The reason for the above question is because of the following program; it runs correctly on Quadro FX 5600 (Driver API v2.0) and ION (Driver API v3.2). On Tesla C2050 (Driver API v4.0), however,
depending on the number of texture-fetched variables, it may generate incorrect output.
(In main_kernel0() function in the below code, four variables (rowptr, colind, values, and x) are fetched thru tex1Dfetch() functions. On Tesla, if at least one of the variables is not fetched thru texture function, the program runs correctly, but if all of these variables are fetched together, the program runs incorrectly. Of course, this problem occurs only when run on Tesla C2050.
Can you find any bug in the program? or explain why this weird behavior can occur?

Thanks.

==================

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <cutil.h>
#include <math.h>

static unsigned int gpuBytes = 0;

texture<int, 1, cudaReadModeElementType> texture__colind;
texture<int, 1, cudaReadModeElementType> texture__rowptr;
texture<float, 1, cudaReadModeElementType> texture__values;
texture<float, 1, cudaReadModeElementType> texture__x;
texture<float, 1, cudaReadModeElementType> texture__y;
int * gpu__colind__main;
int * gpu__rowptr__main;
float * gpu__values__main;
float * gpu__x__main;
float * gpu__y__main;

extern double timer_();

int colind[1853104];
int rowptr[(14000+1)];
float values[1853104];
float x[14000];
float y[14000];

global void main_kernel0(int * colind, int * rowptr, float * values, float * x, float * y)
{
int i;
int j;
float temp;
float tx;
int ii;
int _bid = (blockIdx.x+(blockIdx.ygridDim.x));
int _gtid = (threadIdx.x+(_bidblockDim.x));
i=_gtid;
if (i<14000)
{
temp=0.0F;
//for (j=rowptr[i]; j<rowptr[(1+i)]; j ++ )
for (j=tex1Dfetch(texture__rowptr, i); j<tex1Dfetch(texture__rowptr, (1+i)); j ++ )
{
ii = tex1Dfetch(texture__colind, (j-1))-1;
//ii = colind[(j-1)]-1;
tx = tex1Dfetch(texture__x, ii);
//tx = x[ii];
temp=(temp+(tex1Dfetch(texture__values, (j-1))*tx));
//temp=(temp+(values[(j-1)]*tx));
}
y[i]=temp;
}
}

global void main_kernel1(float * x, float * y)
{
int exp0;
int i;
int j;
int _bid = (blockIdx.x+(blockIdx.ygridDim.x));
int _gtid = (threadIdx.x+(_bidblockDim.x));
i=_gtid;
if (i<14000)
{
exp0=((int)log10f(fabsf(tex1Dfetch(texture__y, i))));
x[i]=tex1Dfetch(texture__y, i);
if ((( - exp0)<=0))
{
for (j=1; j<=(1+exp0); j ++ )
{
x[i]=(x[i]/10.0F);
}
}
else
{
if (((1+exp0)<=0))
{
j=( - 1);
for (j=1; j<=( - exp0); j ++ )
{
x[i]=(10.0F*x[i]);
}
}
}
}
}

int main()
{
FILE * fp10;
char filename1[96] = “/home/f6l/CUDAInput/”;
char filename2[32] = “appu.rbC”;
float temp;
float x_sum;
double s_time1;
double e_time1;
double s_time2;
double e_time2;
double s_time3;
double e_time3;
int exp0;
int i;
int j;
int k;
int r_ncol;
int r_nnzero;
int r_nrow;
int _ret_val_0;

	////////////////////////////////
	// CUDA Device Initialization //
	////////////////////////////////
	int deviceCount;
	CUDA_SAFE_CALL_NO_SYNC(cudaGetDeviceCount(&deviceCount));
	if (deviceCount == 0) {
			fprintf(stderr, "cutil error: no devices supporting CUDA.\n");
			exit(EXIT_FAILURE);
	}
	int dev = 0;
	cudaDeviceProp deviceProp;
	CUDA_SAFE_CALL_NO_SYNC(cudaGetDeviceProperties(&deviceProp, dev));
	fprintf(stderr, "Using device %d: %s\n", dev, deviceProp.name);
	CUDA_SAFE_CALL(cudaSetDevice(dev));

	gpuBytes=(1853104*sizeof (int));
	CUDA_SAFE_CALL(cudaMalloc(((void *  * )( & gpu__colind__main)), gpuBytes));
	cudaBindTexture(0, texture__colind, gpu__colind__main, gpuBytes);
	gpuBytes=((14000+1)*sizeof (int));
	CUDA_SAFE_CALL(cudaMalloc(((void *  * )( & gpu__rowptr__main)), gpuBytes));
	cudaBindTexture(0, texture__rowptr, gpu__rowptr__main, gpuBytes);
	gpuBytes=(1853104*sizeof (float));
	CUDA_SAFE_CALL(cudaMalloc(((void *  * )( & gpu__values__main)), gpuBytes));
	cudaBindTexture(0, texture__values, gpu__values__main, gpuBytes);
	gpuBytes=(14000*sizeof (float));
	CUDA_SAFE_CALL(cudaMalloc(((void *  * )( & gpu__x__main)), gpuBytes));
	cudaBindTexture(0, texture__x, gpu__x__main, gpuBytes);
	gpuBytes=(14000*sizeof (float));
	CUDA_SAFE_CALL(cudaMalloc(((void *  * )( & gpu__y__main)), gpuBytes));
	gpuBytes=(14000*sizeof (float));
	cudaBindTexture(0, texture__y, gpu__y__main, gpuBytes);
	printf("**** SerialSpmul starts! ****\n");
	strcat(filename1, filename2);
	printf("Input file: %s\n", filename2);
	s_time1=timer_();
	s_time2=timer_();
	if (((fp10=fopen(filename1, "r"))==((void * )0)))
	{
			printf("FILE %s DOES NOT EXIST; STOP\n", filename1);
			exit(1);
	}
	printf("FILE open done\n");
	fscanf(fp10, "%d %d %d", ( & r_nrow), ( & r_ncol), ( & r_nnzero));
	if ((r_nrow!=14000))
	{
			printf("alarm: incorrect row\n");
			exit(1);
	}
	if ((r_ncol!=14000))
	{
			printf("alarm: incorrect col\n");
			exit(1);
	}
	if ((r_nnzero!=1853104))
	{
			printf("alarm: incorrect nzero\n");
			exit(1);
	}
	for (i=0; i<=14000; i ++ )
	{
			fscanf(fp10, "%d", (rowptr+i));
	}
	for (i=0; i<1853104; i ++ )
	{
			fscanf(fp10, "%d", (colind+i));
	}
	for (i=0; i<1853104; i ++ )
	{
			fscanf(fp10, "%E", (values+i));
			/* for float variables */
	}
	fclose(fp10);
	j=0;
	for (i=0; i<14000; i ++ )
	{
			{

LB99:
temp=values[j];
}
if (((( - 0.1F)<temp)&&(temp<0.1F)))
{
j+=1;
/* goto LB99; /
/ Added by SYLee /
if ((temp==0.0F))
{
goto LB99;
}
x[i]=temp;
continue;
}
exp0=((int)log10f(fabsf(temp)));
x[i]=temp;
if ((( - exp0)<=0))
{
for (k=1; k<=(1+exp0); k ++ )
{
x[i]=(x[i]/10.0F);
}
}
else
{
if (((1+exp0)<=0))
{
k=( - 1);
for (k=1; k<=( - exp0); k ++ )
{
x[i]=(10.0Fx[i]);
}
}
}
if (((1.0F<x[i])||(x[i]<( - 1.0F))))
{
printf(“alarm initial i = %d\n”, i);
printf(“x = %E\n”, x[i]);
printf(“value = %E\n”, values[(1000+i)]);
printf(“exp = %d\n”, exp0);
exit(1);
}
j+=1;
}
printf(“initialization done\n”);
e_time2=timer_();
s_time3=timer_();

	dim3 dimBlock0(384, 1, 1);
	dim3 dimGrid0(37, 1, 1);
	gpuBytes=(1853104*sizeof (int));
	CUDA_SAFE_CALL(cudaMemcpy(gpu__colind__main, colind, gpuBytes, cudaMemcpyHostToDevice));
	gpuBytes=((14000+1)*sizeof (int));
	CUDA_SAFE_CALL(cudaMemcpy(gpu__rowptr__main, rowptr, gpuBytes, cudaMemcpyHostToDevice));
	gpuBytes=(1853104*sizeof (float));
	CUDA_SAFE_CALL(cudaMemcpy(gpu__values__main, values, gpuBytes, cudaMemcpyHostToDevice));
	gpuBytes=(14000*sizeof (float));
	CUDA_SAFE_CALL(cudaMemcpy(gpu__x__main, x, gpuBytes, cudaMemcpyHostToDevice));
	dim3 dimBlock1(384, 1, 1);
	dim3 dimGrid1(37, 1, 1);
	//cudaThreadSynchronize();
	for (k=0; k<500; k ++ )
	{
			main_kernel0<<<dimGrid0, dimBlock0, 0, 0>>>(gpu__colind__main, gpu__rowptr__main, gpu__values__main, gpu__x__main, gpu__y__main);
			//cudaThreadSynchronize();
			main_kernel1<<<dimGrid1, dimBlock1, 0, 0>>>(gpu__x__main, gpu__y__main);
			//cudaThreadSynchronize();
	}
	//cudaThreadSynchronize();
	gpuBytes=(14000*sizeof (float));
	CUDA_SAFE_CALL(cudaMemcpy(x, gpu__x__main, gpuBytes, cudaMemcpyDeviceToHost));
	e_time3=timer_();
	e_time1=timer_();
	printf("Total elapsed time = %f seconds\n", (e_time1-s_time1));
	printf("Initialize time = %f seconds\n", (e_time2-s_time2));
	printf("Main Comp time = %f seconds\n", (e_time3-s_time3));
	x_sum=0.0F;
	for (i=0; i<14000; i ++ )
	{
			x_sum+=x[i];
	}
	printf("%d: x_sum = %.12E\n", (k+1), x_sum);
	_ret_val_0=0;
	CUDA_SAFE_CALL(cudaFree(gpu__colind__main));
	CUDA_SAFE_CALL(cudaFree(gpu__rowptr__main));
	CUDA_SAFE_CALL(cudaFree(gpu__values__main));
	CUDA_SAFE_CALL(cudaFree(gpu__x__main));
	CUDA_SAFE_CALL(cudaFree(gpu__y__main));
	fflush(stdout);
	fflush(stderr);
	return _ret_val_0;

}

njuffa · June 4, 2011, 2:22am

tex1Dfetch() calls map to texture instructions at the machine level (you can use cuobjdump if you want to see details at that level), so there is no limit on how many instances of tex1Dfetch() there can be in a kernel.

I don’t spot anything obviously wrong with the code. One thing to watch out for is writes to the storage underlying a texture, followed by reading the texture. Since the texture cache is not coherent, such a texture access may or may not reflect the update to the underlying storage, i.e. the behavior becomes undefined. I don’t spot any instance of that, but you may want to double check.

You ran code compiled with the CUDA 3.2 toolchain on ION and it worked, and ran the same code compiled with the CUDA 4.0 toolchain on C2050 and it failed. Have you tried running the code compiled with the CUDA 3.2 compiler on the C2050 for a more tightly controlled experiment?

I would recommend adding error checking to every API call, to ensure that nothing unexpected happens, such as a failing cudaBindTexture() call, or cudaBindTexture() returning a non-zero offset.

lee222 · June 4, 2011, 1:52pm

tex1Dfetch() calls map to texture instructions at the machine level (you can use cuobjdump if you want to see details at that level), so there is no limit on how many instances of tex1Dfetch() there can be in a kernel.

I don’t spot anything obviously wrong with the code. One thing to watch out for is writes to the storage underlying a texture, followed by reading the texture. Since the texture cache is not coherent, such a texture access may or may not reflect the update to the underlying storage, i.e. the behavior becomes undefined. I don’t spot any instance of that, but you may want to double check.

You ran code compiled with the CUDA 3.2 toolchain on ION and it worked, and ran the same code compiled with the CUDA 4.0 toolchain on C2050 and it failed. Have you tried running the code compiled with the CUDA 3.2 compiler on the C2050 for a more tightly controlled experiment?

I would recommend adding error checking to every API call, to ensure that nothing unexpected happens, such as a failing cudaBindTexture() call, or cudaBindTexture() returning a non-zero offset.

I tested the code with error checking, but no error was caught; I also tried with cudagdb and cudamemcheck, but no error was found, and returned offset was zero in all binding calls.

About the texture cache coherency issue, the underlying storage is read in one kernel and written in the other kernel. As I know, separate kernel calls always ensure the global synchronization, and thus texture cache coherency should not be a problem too, as long as write accesses and read accesses are done in separate kernels. Is there something wrong in this assumption?

Thanks.

[This is added later]

I also tested it on the C2050 with Driver V4.0 and Runtime V3.2, but the same problem occurred, and I verified that this problem is reproduced in other Tesla C2050 GPU with Driver V4.0 and Runtime V4.0.

lee222 · June 6, 2011, 10:04pm

One interesting thing is that if I change the for-loop in the main_kernel0() function from form1 to form2, as shown below,
no error occurs, which may mean that texture memory access patterns in the for-loop causes a bug.

[Form1]
for (j=tex1Dfetch(texture__rowptr, i); j<tex1Dfetch(texture__rowptr, (1+i)); j ++ ) {
ii = tex1Dfetch(texture__colind, (j-1))-1;
tx = tex1Dfetch(texture__x, ii);
temp=(temp+(tex1Dfetch(texture__values, (j-1))*tx));
}

[Form2]
for (j=0; j < (tex1Dfetch(texture__rowptr, (i+1)) - tex1Dfetch(texture__rowptr, i)); j ++ ) {
ii = tex1Dfetch(texture__colind, (tex1Dfetch(texture__rowptr, i)+j-1))-1;
tx = tex1Dfetch(texture__x, ii);
temp=(temp+(tex1Dfetch(texture__values, (tex1Dfetch(texture__rowptr, i)+j-1))*tx));
}

Can anyone explain this behavior?

zuoqing · June 16, 2011, 2:15pm

I encounter a same problem

zuoqing · June 16, 2011, 2:15pm

I encounter a same problem

Topic		Replies	Views
the memory size limit for binding to a texture CUDA Programming and Performance	6	2183	June 24, 2010
size limitation of tex1Dfetch CUDA Programming and Performance	5	1694	February 13, 2012
Texture Memory Are there any limitations to texture memory CUDA Programming and Performance	5	2597	January 22, 2012
tex1DFetch limit CUDA Programming and Performance	2	2817	September 24, 2009
3GB can it be read as texture? CUDA Programming and Performance	25	3281	December 31, 2014
Weird behaviour with large data and textures CUDA Programming and Performance	0	688	August 3, 2009
1D texture bound to linear memory CUDA Programming and Performance	2	2051	December 18, 2009
REALLY strange behavior with texture Compile and test the joined source code CUDA Programming and Performance	5	4197	June 11, 2008
problems with texture fetching from cuda arrays CUDA Programming and Performance	1	1783	June 18, 2007
is 4Mbyte the max allowed size for linear int tex texture size CUDA Programming and Performance	0	1713	March 29, 2007

Actual max texture dimension size of Tesla C2050

Related topics