Kernel doesn't benefit from Texture Mem

Hey guys,

I ported my opencl kernel to use texture memory instead of global memory.
Unfortunately, the new kernel doesn’t seem to achieve any speedups against the kernel using global memory.
The nvidia profiler says the new kernel is performing just about 20% of the global memory accesses done in the older kernel version.
But the instruction and branching counter increased to about 4-5 times than it was before.
How can I understand this? This doesn’t make sense to me at all.

Any advice appreciated. Thanks in advance.

Hey guys,

I ported my opencl kernel to use texture memory instead of global memory.
Unfortunately, the new kernel doesn’t seem to achieve any speedups against the kernel using global memory.
The nvidia profiler says the new kernel is performing just about 20% of the global memory accesses done in the older kernel version.
But the instruction and branching counter increased to about 4-5 times than it was before.
How can I understand this? This doesn’t make sense to me at all.

Any advice appreciated. Thanks in advance.

Basically the texture memory can help only when your algorithm exposes data locality (texture’s cache). When there is a texture’s cache miss, the IO operation takes as long as the global memory IO operation, I suppose.

It’s hard to say why without seeing the code. Try to paste the previous and contemporary code.

Basically the texture memory can help only when your algorithm exposes data locality (texture’s cache). When there is a texture’s cache miss, the IO operation takes as long as the global memory IO operation, I suppose.

It’s hard to say why without seeing the code. Try to paste the previous and contemporary code.

I’m writing a raytracer on an opencl base. The nodes of my acceleration structure (bvh) are located in texture memory. One block traces a bunch of neighbouring rays (Threads) send from a camera. So I think this will be enough data locality, since each neighbouring ray is going in a similar direction and therefore accessing almost the same nodes of the tree structure. Most other raytracers written in cuda/opencl benefit from storing the tree nodes in texture mem. Any idea?

I’m writing a raytracer on an opencl base. The nodes of my acceleration structure (bvh) are located in texture memory. One block traces a bunch of neighbouring rays (Threads) send from a camera. So I think this will be enough data locality, since each neighbouring ray is going in a similar direction and therefore accessing almost the same nodes of the tree structure. Most other raytracers written in cuda/opencl benefit from storing the tree nodes in texture mem. Any idea?

I see, but what is your BVH layout in texture? What does your BVH node looks like and how do you identify it’s children? Generally texture’s caches help in raytracing, you are right.

I see, but what is your BVH layout in texture? What does your BVH node looks like and how do you identify it’s children? Generally texture’s caches help in raytracing, you are right.

I store one bvh node in 4 pixels of a floating-point RGBA texture. The first and second pixel contains x and y coordinates of the childs bboxes, the third pixel contains the z coordinates for both bboxes and the last pixel contains the indices of the two child nodes and the start-/endindex for the triangles in the node if it’s a leaf. The 2 bboxes are always fetched and testet together.

If I use very small scenes I got a speedup of about 4-5x from gpu against cpu. For scenes >100k Triangles I got pretty bad performance (slower than cpu). Is it because of just having 8kb texture mem available per multiprocessor? But how do other raytracer benefit from texture caching?

Thanks a lot.

I store one bvh node in 4 pixels of a floating-point RGBA texture. The first and second pixel contains x and y coordinates of the childs bboxes, the third pixel contains the z coordinates for both bboxes and the last pixel contains the indices of the two child nodes and the start-/endindex for the triangles in the node if it’s a leaf. The 2 bboxes are always fetched and testet together.

If I use very small scenes I got a speedup of about 4-5x from gpu against cpu. For scenes >100k Triangles I got pretty bad performance (slower than cpu). Is it because of just having 8kb texture mem available per multiprocessor? But how do other raytracer benefit from texture caching?

Thanks a lot.

Does your 4 pixels of BVH node make linear “array” or do they form a rectangle? Rectangle would be better (texels are stored in Z order, I think).

Why don’t you store just a node BBox (6 floats), pointer to first child (second child should be a neighbor) or pointer to triangle and number of primitives, which will show what is inner node and what is leaf? So only for texels of RGBA texture should be enough, or am I wrong?

Do you have a 2D range kernel, with sizes according to number of rays?

Does your 4 pixels of BVH node make linear “array” or do they form a rectangle? Rectangle would be better (texels are stored in Z order, I think).

Why don’t you store just a node BBox (6 floats), pointer to first child (second child should be a neighbor) or pointer to triangle and number of primitives, which will show what is inner node and what is leaf? So only for texels of RGBA texture should be enough, or am I wrong?

Do you have a 2D range kernel, with sizes according to number of rays?