Problem with texture memory


I’m currently working on using BVH for Frustum Culling with CUDA…

For now I’m using the hierarchy nodes from global memory.

I was guessing… maybe if I used texture memory instead, I would get better performance (since traversal results in a lot of random access to the nodes)…

I’m preparing the texture memory like this :


//Prepare texture memory for hierarchy

texture<hnode_t, 1, cudaReadModeElementType> texHIERARCHY;

cutilSafeCall(cudaBindTexture(0, texHIERARCHY, thrust::raw_pointer_cast(d_HIERARCHY), sizeof(hnode_t)*LBVH_compute_hierachy_mem_size()));



while hnode_t is like this :

[codebox]//Hierarchy node structure

typedef struct hnode


unsigned int splitLevel;

unsigned int primStart;

unsigned int primStop;

unsigned int ID;

unsigned int childrenStart;

unsigned int childrenStop;

bool visible;

aabb_t bbox;

} hnode_t;


Size of the structure is 52 bytes (got this from sizeof()).

When I execute this code… I get this error :

[codebox]First-chance exception at 0x7c812afb in TestEnvironment.exe: Microsoft C++ exception: cudaError at memory location 0x0012f9a8…[/codebox]

What am I doing wrong ?

I already used texture fetches for another project, and it was working just perfect…

Thanks for your help !

Quoting the Developer Manual for CUDA 2.3: Texture Reference Declaration
Some of the attributes of a texture reference are immutable and must be known at
compile time; they are specified when declaring the texture reference. A texture
reference is declared at file scope as a variable of type texture:
texture<Type, Dim, ReadMode> texRef;
􀂉 Type specifies the type of data that is returned when fetching the texture; Type
is restricted to the basic integer and single-precision floating-point types and any
of the 1-, 2-, and 4-component vector types defined in Section B.3.1;

Okay si it seems I can’t use texture memory for the whole struct…

Problem unsolved. But at least I know why it doesn’t work.

You can always bind an int/float texture to any structure you like and then once in shared memory/registers cast it to your structure. Damn ugly but should work ;)

guess so… As I only have floats and unsigned ints in my struct… hum. Are both encoded with 4 bytes on device memory ?

SoA is defenitely the way to go on GPU. That way you can also skip stuff you might not need until you actually need it (that happens in my raytracer that uses KD-tree)

But can I fetch a whole region of the texture memory in one function call ? like a memcpy… but for texture memory :/

Nope, but you can fetch 1 uint4 for the first 4 values, a uint2 for the second two, you can pack your bool’s (extracting them with bit-ops) and put your aabb in 3 float2’s.

I think it all depends on what you need.

If every thread has to process a different element of your array then indeed a SoA is a way to go as Riedijk suggests. (I also play with kd-trees ;) )

However if you want a single structure element for your whole block (or at least a warp), AoS is a better organisation. In that organisation, you can employ all threads to fetch the element in parallel for you, each taking a 32-bit piece of data. To ease my work I usually use the following template function:

template <typename T>

__device__ void memCopySingle(T *destination, T *source) {

	int *dest=(int *)destination;

	int *src=(int *)source;

	if (threadIdx.x<sizeof(T)/4) {