Checking my OpenCL understanding


Just looking for a general sanity check of my approach to an OpenCL algorithm. Any advice would be greatly appreciated.

I have a situation where I want to upload a height map (triangle mesh) to a GPU based OpenCL device, along with rays to be intersected against the mesh. The mesh could be somewhat large (i.e. <= 10,000,000) triangles. The number of rays being intersected will be relatively small (i.e. <= 5000). I want to report back which triangles were intersected by each ray. The test will be run very often, and the contents of the height map and ray data will change each time.

Based on studying some examples and OpenCL introductions, I’m thinking I would upload the triangle mesh to a global memory buffer on the device, then queue up the intersect tests such that each work item tested 1 ray against all mesh triangles. This would often result in a few hundred work items, each responsible for a few million triangle hit tests. Ray-Triangle intersects would be returned to the host app through global memory.

Before I dive head-long into the experiment, I had a few questions to any OpenCL guru’s out there:


[*]Does this approach seem sane? Would I be better off dividing the work such that one ray is processed at a time, with each work item being a single (or a few) triangle intersect tests for that ray? I assume that would result in a lot of overhead queuing commands?

[*]I’ve read that access to the global memory is slow compared to all the rest, but I don’t think it is reasonable to copy the mesh data to work group local sets before processing. Am I missing something here, or is it best to just have the work items access the triangle data from global memory?

[*]Any comments or suggestions are really appreciated. I’m very comfortable with CPU-based multi-threading, but this is all very new to me.


It depends on your OpenCL Device if you can achieve full occupancy with a few hundred work-items.

It could be better to iterate through the triangles first:

PID = Unique Process ID;

for(i= PID; i < totalTriangles; i+= totalThreads)

    for(n=0; n < totalRays; n++)

        // do computation

It depends on your computation, Global Memory Access is about 400 to 800 cycles. Running multible Warps can hide this memory latency.


Ok, thanks a lot for your suggestions.

One follow up question…you mention that ‘running multiple warps can hide global memory access latency’. I’m not sure I fully understand why, and I’m less familiar with NVidia-specific terminology than OpenCL at this point. Is the nature of the global memory access such that when all threads in a warp access it at the same time, it’s counted as one access (with one latency penalty instead of 32)? Even though the threads in the warp are each reading different regions of global memory?

All work-items in one warp are executed in “SIMD” or “SIMT” fashion, means that every item waits for the others to finish.

Global memory access in one warp should be counted as one latency penalty (but depends on memory bandwidth).

By running more Warps you can do more computations while waiting for the memory access.

In real life code behaves different then on paper, so you will have to test different scenarios to check with the help of the profiler which suites best for your algorithm.


Here is an good paper on latencys and threads (work-items)


That article was very helpful, thanks!