Texture contains wrong values Using tex1Dfetch

Hi. I’m storing sphere data (center co-ordinates and radius) with material data in a texture using the following code:

[codebox]hSpheres = (Sphere*)malloc(sizeof(Sphere)*hNumSpheres[0]);

for(int i = 0; i < hNumSpheres[0]; i++)

{

	temp =  1.0f - 20.0f*(float)rand()/(float)RAND_MAX;

	temp2 = 1.0f - 20.0f*(float)rand()/(float)RAND_MAX;

	temp3 = 1.0f - 20.0f*(float)rand()/(float)RAND_MAX;

	hSpheres[i].setCenter(temp, temp2, temp3);

	hSpheres[i].setRadius(0.5f);

	hSpheres[i].setMaterial(0, 0);

}

CUDA_SAFE_CALL(cudaMalloc(&device_spheres, hNumSpheres[0] * sizeof(float4)));

CUDA_SAFE_CALL(cudaMemcpy(device_spheres, &hSpheres[0].sphereData, hNumSpheres[0] * sizeof(float4), cudaMemcpyHostToDevice));

sphere_texture_bind(device_spheres, 100 * sizeof(float4));

CUDA_SAFE_CALL(cudaMalloc(&device_spheresMaterials, hNumSpheres[0] * sizeof(uint2)));

CUDA_SAFE_CALL(cudaMemcpy(device_spheresMaterials, &hSpheres[0].material, hNumSpheres[0] * sizeof(uint2), cudaMemcpyHostToDevice));

sphere_texture_bind(device_spheresMaterials, 100 * sizeof(uint2));

Sphere sphere;

sphere.sphereData = tex1Dfetch(spheres, j);

sphere.material = tex1Dfetch(spheresMaterials, j);

[/codebox]

Please help if you can because I’ve checked again and again and can’t see where I’m going wrong. Please check I am allocating the arrays properly and accessing them correctly. The spheres render, randomly as they should but there seem to be two (consistently) massive spheres in the center of the screen, even though the radius’ are all set at a constant 0.5f.

Anyone? It seems if I comment out the texture that stores the spheres materials then the spheres get rendered correctly, with all the same radius values, as they should be.

Have you tried debugging in emulation mode?

Yes, but for some reason when I set a breakpoint, i always = 0 and I can’t see what the values of the radius’ for the other spheres are. I doubt this would help though, because I know all the radius aren’t what they should be. I’ve checked at the point when the radius’ are set and copied to the texture memory and this seems ok, which leads me to believe it’s something to do with tex1Dfetch.

That’s because every thread will execute in an “emulated parallel,” so they will alternate execution. Just set the condition threadIdx.x == 0 in the breakpoint condition; that should clear it up.

Ok, I’ve debugged it now and it seems that both my sphereData and spheresMaterial textures seem to be interleaved. So if I do tex2Dfetch([name of texture], 0) the output is fine. If I do tex2Dfetch([name of texture], 1), then the first two elements of the float4 sphereData are actually the spheresMaterials values and the second two are correct. If I do tex2Dfetch([name of texture], 2), then the last two elements of the float4 sphereData are the spheresMaterials values and the first two are correct. when fetching from address 3, the values are correct again. What’s going on?! I’ve tried adding an offset to the spheresMateriall texture, but not sure what this does and didn’t seem to fix it. Should I be using the same address value to be fetching from both textures? Please check out the code above again if in doubt.

You haven’t posted the definition of your Sphere structure, so it is impossible to know what the problem is.

It appears that you have both center, radius and material in the Sphere struct. So when you

CUDA_SAFE_CALL(cudaMemcpy(device_spheres, &hSpheres[0].sphereData, hNumSpheres[0] * sizeof(float4), cudaMemcpyHostToDevice));

it is expected that the data will be interleaved in the texture.

You probably want to allocate simple float4 staging arrays on the host that you can un-interleave the data into before copying to the device.

[quote name=‘MisterAnderson42’ post=‘506069’ date=‘Feb 15 2009, 04:02 PM’]

You haven’t posted the definition of your Sphere structure, so it is impossible to know what the problem is.

It appears that you have both center, radius and material in the Sphere struct. So when you

[codebox]texture spheres;

sphere.sphereData = tex1Dfetch(spheres, j);


float4 sphereData2[2000];

float2 materials1[2000];

for(int i = 0; i < hNumSpheres[0]; i++)

{

	temp =  1.0f - 20.0f*(float)rand()/(float)RAND_MAX;

	temp2 = 1.0f - 20.0f*(float)rand()/(float)RAND_MAX;

	temp3 = 1.0f - 20.0f*(float)rand()/(float)RAND_MAX;

	sphereData2[i].x = temp;sphereData2[i].y = temp2;sphereData2[i].z = temp3;sphereData2[i].w = 0.2f;

}

CUDA_SAFE_CALL(cudaMalloc( &device_spheres,  hNumSpheres[0] * sizeof(float4)));

CUDA_SAFE_CALL(cudaMemcpy(device_spheres, sphereData2,hNumSpheres[0] * sizeof(float4), cudaMemcpyHostToDevice) );

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc();

CUDA_SAFE_CALL(cudaBindTexture(0, spheres, device_spheres, channelDesc) );

---------------[/codebox]

It seems now, when I step through the code and debug, that the values of the four floats are separated by two zeros each time. Here is a picture describing texture memory, where X is useful spehereData and - is crappy zeros:

XXXX–XXXX–XXXX. Obviously using tex1Dfetch, picks up these zeros and messes up the positions of my sphere. So my question is how to get rid of those zeros and why are they appearing?

Any NVIDIA employees? I’m beginning to think it’s a bug as there aren’t many examples around on the net (I’ve checked extensively) and I’ve no idea what the offset value is when doing cudaBindTexture and the text in the cuda guide is a bit vague on how to reference textures properly etc. Someone please help :(

The amount of code you posted is insufficient to make any accurate statements as to the cause of the problem. Nvidia employees will deal with a problem only if it reveals a bug in the CUDA libraries, but that most likely isn’t the case here.

Without the definition of Sphere, and a kernel, there’s not much I can say about your code.
And the declaration of hNumSpheres would help a bit in trying to understand what you’re trying to do.

Ok, here is all the code you ask for… please work your magic, guys!

[codebox]

int* hNumSpheres;

void setupRaytracer(int pbo_in, int pbo_out, int width, int height, float cameraEye[3], float rotate[3], int rcMoveX, int rcMoveY, int prev_rcMoveX, int prev_rcMoveY, int move)

{

int* out_data;

dim3 block(8, 8, 1);

dim3 grid(width / block.x, height / block.y, 1);

CUDA_SAFE_CALL(cudaGLMapBufferObject( (void**)&out_data, pbo_out));

build(width, height, cameraEye, rotate, rcMoveX, rcMoveY, prev_rcMoveX, prev_rcMoveY, move);

render<<< grid, block>>>(out_data);

CUDA_SAFE_CALL(cudaGLUnmapBufferObject(pbo_out));

}

void initCuda(int argc, char **argv, int width, int height, float cameraEye[3], float rotate[3])

{

CUT_DEVICE_INIT(argc, argv);

float temp, temp2, temp3;

//----------------Define number of spheres and pLights---------------//

hNumSpheres = (int*)malloc(sizeof(int));

hNumSpheres[0] = 2000;

hNumPlanes = (int*)malloc(sizeof(int));

hNumPlanes[0] = 1;

hNumLights = (int*)malloc(sizeof(int));

hNumLights[0] = 1;

//-------------------------Set up viewplane-------------------------//

hVP = (ViewPlane*)malloc(sizeof(ViewPlane));

hVP->setRes(width, height);

hVP->set_samples(1);

hVP->maxDepth = 1;

//--------------------------Set up camera---------------------------//

hCamera = (Pinhole*)malloc(sizeof(Pinhole));

hCamera->setEye(cameraEye[0], cameraEye[1], cameraEye[2]);

hCamera->setD(500);

hCamera->u = make_float3(-1, 0, 0);

hCamera->v = make_float3(0, 1, 0);

hCamera->w = make_float3(0, 0, -1);

hCamera->rayRotationMatrix = IdentityMatrix3x3();

hCamera->rotate(rotate[1], rotate[0], rotate[2]);

set_uw(hCamera->u.x, hCamera->u.y, hCamera->u.z, hCamera->w.x, hCamera->w.y, hCamera->w.z);

hCamera->set_exposure_time(1.0);

//---------------------------Set up tracer---------------------------//

hWhitted = (Whitted*)malloc(sizeof(Whitted));

//--------------------------Set up pLights---------------------------//

hpLights = (PointLight*)malloc(sizeof(PointLight)*hNumLights[0]);

hpLights[0].setLocation(10, 10, -5);

hpLights[0].scaleRadiance(5.0);

hpLights[0].setColour(1, 1, 1);

hpLights[0].shadows = false;

hpLights[0].setLocation(-10, 10, -7);

hpLights[0].scaleRadiance(5.0);

hpLights[0].setColour(1, 1, 1);

hpLights[0].shadows = false;



hAmbLight = (AmbientLight*)malloc(sizeof(AmbientLight));

hAmbLight->setColour(1, 1, 1);

hAmbLight->scaleRadiance(0.2f);

//-------------------------Set up materials-------------------------//

hMaterials = (Material*)malloc(sizeof(Material)*1);

hMaterials[0].setCD(0.5f, 0.2f, 0.5f);

hMaterials[0].setKD(0.65f);

hMaterials[0].setKA(0.25f);

//--------------------------Set up spheres--------------------------//

hSpheres = (Sphere*)malloc(sizeof(Sphere)*hNumSpheres[0]);

float4 sphereData2[2000];

float2 materials1[2000];

for(int i = 0; i < hNumSpheres[0]; i++)

{

	temp =  1.0f - 20.0f*(float)rand()/(float)RAND_MAX;

	temp2 = 1.0f - 20.0f*(float)rand()/(float)RAND_MAX;

	temp3 = 1.0f - 20.0f*(float)rand()/(float)RAND_MAX;

	hSpheres[i].setCenter(temp, temp2, temp3);

	hSpheres[i].setRadius(0.2f);

	hSpheres[i].setMaterial(0, 0);

	sphereData2[i].x = temp;sphereData2[i].y = temp2;sphereData2[i].z = temp3;sphereData2[i].w = 0.2f;

	materials1[i].x = 1; materials1[i].y = 1;

}



hPlanes = (Plane*)malloc(sizeof(Plane)*hNumPlanes[0]);

hPlanes[0].a = make_float3(0, 0, 0);

hPlanes[0].n = make_float3(0, 1, 0);

hPlanes[0].setMaterial(2, 0);

CUDA_SAFE_CALL(cudaMalloc( &device_spheres,  hNumSpheres[0] * sizeof(float4)));

CUDA_SAFE_CALL(cudaMemcpy(device_spheres, sphereData2,hNumSpheres[0] * sizeof(float4), cudaMemcpyHostToDevice) );

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc();

CUDA_SAFE_CALL(cudaBindTexture(0, spheres, device_spheres, channelDesc) );

/*CUDA_SAFE_CALL(cudaMalloc(&device_spheresMaterials, hNumSpheres[0] * sizeof(float2)));

CUDA_SAFE_CALL(cudaMemcpy(device_spheresMaterials, &materials1, hNumSpheres[0] * sizeof(float2), cudaMemcpyHostToDevice));

cudaChannelFormatDesc channelDesc2 = cudaCreateChannelDesc<float2>();

CUDA_SAFE_CALL(cudaBindTexture(0, spheresMaterials, device_spheresMaterials, channelDesc2) );*/

//------------Copy spheres to constant memory on device-------------//

CUDA_SAFE_CALL(cudaMemcpyToSymbol(numLights, hNumLights, sizeof(int)));

CUDA_SAFE_CALL(cudaMemcpyToSymbol(numPlanes, hNumPlanes, sizeof(int)));

CUDA_SAFE_CALL(cudaMemcpyToSymbol(numSpheres, hNumSpheres, sizeof(int)));

CUDA_SAFE_CALL(cudaMemcpyToSymbol(vp, hVP, sizeof(ViewPlane)));

CUDA_SAFE_CALL(cudaMemcpyToSymbol(camera, hCamera, sizeof(Pinhole)));

CUDA_SAFE_CALL(cudaMemcpyToSymbol(tracer, hWhitted, sizeof(Whitted)))

CUDA_SAFE_CALL(cudaMemcpyToSymbol(materials, hMaterials, sizeof(Material)*1));

CUDA_SAFE_CALL(cudaMemcpyToSymbol(planes, hPlanes, sizeof(Plane)*hNumPlanes[0]));

CUDA_SAFE_CALL(cudaMemcpyToSymbol(ambientLight, hAmbLight, sizeof(AmbientLight)));

CUDA_SAFE_CALL(cudaMemcpyToSymbol(pLights, hpLights, sizeof(PointLight)*hNumLights[0]));

}


#ifndef SPHERE

define SPHERE

struct Sphere{

float4 	sphereData;

float2   material;



__host__ __device__ void setMaterial(int i, int t)

{

	material.x = i;

	material.y = t;

}

__host__ __device__ void	setCenter(const float x, const float y, const float z)

{

	sphereData.x = x; sphereData.y = y; sphereData.z = z;

}

__host__ __device__ void	setRadius(const float r)

{

	sphereData.w = r;

}

__host__ __device__ bool shadowHit(const Ray& ray, float& tmin) const

{

	float 		t;

	float3   	temp;

	temp.x =	ray.o.x - sphereData.x; temp.y =	ray.o.y - sphereData.y; temp.z =	ray.o.z - sphereData.z;

	float 		a 		= dot(ray.d, ray.d);

	float 		b 		= 2.0f * dot(temp, ray.d);

	float 		c 		= dot(temp, temp) - sphereData.w * sphereData.w;

	float 		disc	= b * b - 4.0f * a * c;

	if (disc < 0.0f)

		return(false);

	else {	

		float e = sqrt(disc);

		float denom = 2.0f * a;

		t = (-b - e) / denom;    // smaller root

		if (t > kEpsilon) {

			tmin = t;

			return (true);

		} 

		t = (-b + e) / denom;    // larger root

		if (t > kEpsilon) {

			return (true);

		} 

	}

	return (false);

}

__host__ __device__ bool hit(const Ray& ray, float& tmin, ShadeRec& sr) const

{

	float 		t;

	float3   	temp;

	temp.x =	ray.o.x - sphereData.x; temp.y =	ray.o.y - sphereData.y; temp.z =	ray.o.z - sphereData.z;

	float 		a 		= dot(ray.d, ray.d);

	float 		b 		= 2.0f * dot(temp, ray.d);

	float 		c 		= dot(temp, temp) - sphereData.w * sphereData.w;

	float 		disc	= b * b - 4.0f * a * c;

	if (disc < 0.0f)

		return(false);

	else {	

		float e = sqrt(disc);

		float denom = 2.0f * a;

		t = (-b - e) / denom;    // smaller root

		if (t > kEpsilon) {

			tmin = t;

			sr.normal 	 = (temp + t * ray.d) / sphereData.w;

			sr.localHitPoint = ray.o + t * ray.d;

			return (true);

		} 

		t = (-b + e) / denom;    // larger root

		if (t > kEpsilon) {

			tmin = t;

			sr.normal   = (temp + t * ray.d) / sphereData.w;

			sr.localHitPoint = ray.o + t * ray.d;

			return (true);

		} 

	}

	return (false);

}

};

endif[/codebox]

Your problem most likely stems from the incorrect use of cudaBindTexture.

You have

[codebox]

texture<float4, 1, cudaReadModeElementType> spheres;

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc();

CUDA_SAFE_CALL(cudaBindTexture(0, spheres, device_spheres, channelDesc) );

[/codebox]

You are doing a mix of the low-level and high-level texture APIs.

This what the programming guide says:

I’m surpsied if it compiles. Get rid of channelDesc and pass the size in bytes, (*hNumSpheres) * sizeof(float4), as the 4th parameter.

Let us know if this clears the problem.

Ok, I did the code like you said but am still getting pairs of zeros in my output from tex1Dfetch, so it is still messing up the placement of the sphere as sometimes the x and y, or z and radius co-ordinates equal zero. Was still getting interleaved output due to both sphere data members, so as you (or someoen else suggested) I separated the data in memory before copying to device. Next problem was that because my spheresMaterials was float2, this was being separated by pairs of zeros, so I had to do this:

[codebox]sphere.sphereData = tex1Dfetch(spheres, j);

	sphere.material = tex1Dfetch(spheresMaterials, j*2);

[/codebox]

Why do I have to do this? Can’t I just use j without multiplying by 2 to get the right results?

Here’s how I generate the sphere data now:

[codebox]float4 sphereData2[2000];

float4 materials1[2000];

for(int i = 0; i < hNumSpheres[0]; i++)

{

	temp =  1.0f - 20.0f*(float)rand()/(float)RAND_MAX;

	temp2 = 1.0f - 20.0f*(float)rand()/(float)RAND_MAX;

	temp3 = 1.0f - 20.0f*(float)rand()/(float)RAND_MAX;

	hSpheres[i].setCenter(temp, temp2, temp3);

	hSpheres[i].setRadius(0.2f);

	hSpheres[i].setMaterial(0, 0);

	sphereData2[i].x = temp;sphereData2[i].y = temp2;sphereData2[i].z = temp3;sphereData2[i].w = 0.2f;

	materials1[i].x = 0; materials1[i].y = 0;

}

CUDA_SAFE_CALL(cudaMalloc( &device_spheres,  (*hNumSpheres) * sizeof(float4)));

CUDA_SAFE_CALL(cudaMemcpy(device_spheres, sphereData2, (*hNumSpheres) * sizeof(float4), cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaBindTexture(0, spheres, device_spheres, (*hNumSpheres) * sizeof(float4)) );

CUDA_SAFE_CALL(cudaMalloc( &device_spheresMaterials,  (*hNumSpheres) * sizeof(float2)));

CUDA_SAFE_CALL(cudaMemcpy(device_spheresMaterials, materials1, (*hNumSpheres) * sizeof(float2), cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaBindTexture(0, spheresMaterials, device_spheresMaterials, (*hNumSpheres) * sizeof(float2)) );[/codebox]

Your code snippets still really don’t give us anything to help you debug. Where is the kernel that reads the data? Maybe the problem is there. Where are the texture definitions? Maybe you are declaring a float texture and trying to read it as a float2?

What we really need a complete and SIMPLE example that we can compile and run ourselves. Only then can we do anything except guess at the problem.

What you really need to do is to keep removing stuff from your code until you have the bare minimum simple piece of code that demonstrates the problem. Then the problem will become obvious and you will solve it and won’t actually need our help :) This is standard problem-solving methodology: find the root cause, then fix it. Since you seem unwilling to do so, I will be very nice and give you a simple fully working example that demonstrates the proper use of textures.

First and foremost, there is no problem with textures in CUDA. You can and should be able to do what you want to do, there is just one or more bugs in your code.

Here is the example: it can be compiled with a simple nvcc -o test test.cu so you can test it yourself. It runs without any gaps of 0’s in the output.

#include "stdio.h"

texture<float4, 1, cudaReadModeElementType> tex;

__global__ void test_read(float4 *d_out)

	{

	unsigned int idx = blockDim.x * blockIdx.x + threadIdx.x;

	d_out[idx] = tex1Dfetch(tex, idx);

	}

int main()

	{

	int block_size = 32;

	int n_blocks = 4;

	int N = block_size * n_blocks;

	float4 *d_in, *d_out, *h_in, *h_out;

	h_in = (float4*)malloc(sizeof(float4)*N);

	h_out = (float4*)malloc(sizeof(float4)*N);

	cudaMalloc((void**)&d_in, sizeof(float4)*N);

	cudaMalloc((void**)&d_out, sizeof(float4)*N);

	// fill out data

	for (unsigned int i = 0; i < N; i++)

		{

		h_in[i].x = (float)(i*4);

		h_in[i].y = (float)(i*4+1);

		h_in[i].z = (float)(i*4+2);

		h_in[i].w = (float)(i*4+3);

		}

	cudaMemcpy(d_in, h_in, sizeof(float4)*N, cudaMemcpyHostToDevice);

	cudaBindTexture(0, tex, d_in);

	test_read<<<n_blocks, block_size>>>(d_out);

	cudaMemcpy(h_out, d_out, sizeof(float4)*N, cudaMemcpyDeviceToHost);

	// print out data

	for (unsigned int i = 0; i < N; i++)

		{

		printf("%f %f %f %f\n", h_out[i].x, h_out[i].y, h_out[i].z, h_out[i].w);

		}

	free(h_in);

	free(h_out);

	cudaFree(d_in);

	cudaFree(d_out);

	return 1;

	}

Thanks for your help guys. Sorry about not posting more code but my code is compex and spead out over about 16 files and would take all year for me to paste it here and for you to try and run.

Mister Anderson - I have fixed the problem with being able to read from one float4 texture. But reading from a second float2 texture seems to be my problem at the moment. Any chance you could adapt your example to read from a float4 texture and and float2 texture and see what you get? Like mine:

[codebox]

texture<float4, 1, cudaReadModeElementType> spheres;

texture<float2, 1, cudaReadModeElementType> spheresMaterials;

CUDA_SAFE_CALL(cudaMalloc( &device_spheres,  (*hNumSpheres) * sizeof(float4)));

CUDA_SAFE_CALL(cudaMemcpy(device_spheres, sphereData2, (*hNumSpheres) * sizeof(float4), cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaBindTexture(0, spheres, device_spheres, (*hNumSpheres) * sizeof(float4)) );

CUDA_SAFE_CALL(cudaMalloc( &device_spheresMaterials,  (*hNumSpheres) * sizeof(float2)));

CUDA_SAFE_CALL(cudaMemcpy(device_spheresMaterials, materials1, (*hNumSpheres) * sizeof(float2), cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaBindTexture(0, spheresMaterials, device_spheresMaterials, (*hNumSpheres) * sizeof(float2)) );

sphere.sphereData = tex1Dfetch(spheres, j);

	sphere.material = tex1Dfetch(spheresMaterials, j*2);

[/codebox]

I’m trying to implement a grid spatial subdivision structure with variable amounts of cells and variable amounts of objects in those cells. The plan is for this structure to then be copied into texture memory boo hoo :(. More debugging I think. I would have used global memory if it was easier but I don’t want to pass my structure to the 100 odd functions that use them

Sheesh. Just do it yourself. It will take all of 20 seconds to change float4 to float2 and comment out the reads/writes of .z and .w.

Unless you can break your problem down into a smaller manageable reproduction case you will never find the bug.

I used to be just like him when I emabrked on tackling with CUDA. this “stuff” is very confusing at first.

OK, fine then. Only because Mr_Nuke asked nicely

And I guess I was wrong. It actually took 25 seconds to modify the code.

#include "stdio.h"

texture<float2, 1, cudaReadModeElementType> tex;

__global__ void test_read(float2 *d_out)

	{

	unsigned int idx = blockDim.x * blockIdx.x + threadIdx.x;

	d_out[idx] = tex1Dfetch(tex, idx);

	}

int main()

	{

	int block_size = 32;

	int n_blocks = 4;

	int N = block_size * n_blocks;

	float2 *d_in, *d_out, *h_in, *h_out;

	h_in = (float2*)malloc(sizeof(float2)*N);

	h_out = (float2*)malloc(sizeof(float2)*N);

	cudaMalloc((void**)&d_in, sizeof(float2)*N);

	cudaMalloc((void**)&d_out, sizeof(float2)*N);

	// fill out data

	for (unsigned int i = 0; i < N; i++)

		{

		h_in[i].x = (float)(i*4);

		h_in[i].y = (float)(i*4+1);

		}

	cudaMemcpy(d_in, h_in, sizeof(float2)*N, cudaMemcpyHostToDevice);

	cudaBindTexture(0, tex, d_in);

	test_read<<<n_blocks, block_size>>>(d_out);

	cudaMemcpy(h_out, d_out, sizeof(float2)*N, cudaMemcpyDeviceToHost);

	// print out data

	for (unsigned int i = 0; i < N; i++)

		{

		printf("%f %f\n", h_out[i].x, h_out[i].y);

		}

	free(h_in);

	free(h_out);

	cudaFree(d_in);

	cudaFree(d_out);

	return 1;

	}