I’ve got a simple kd-tree raycasting kernel on CC 3.5, which is used to estimate the translational distance between two triangle meshes, each with its own transformation matrix as position, for which I use the constant memory space. In this kernel a ray per vertex is started and checked for intersection with all triangles of the other mesh. Since the meshes are very small (~1000 vertices/triangles each), the kernel execution is time very short (some µs), and I need the distance for very many positions of botch meshes checked, I thought I might reduce the overhead for the kernel launch by checking several positions at the same time. But there I encountered two problems:
I need indexed access to constant memory for the transformation matrices. When I use the constant memory space by kernel parameters (e.g. MyKernel<<<…>>>(int Arg1)), in the SASS all kernel parameters, which are accessed per index, are first loaded to local memory space. That’s why the performance decreases the more positions I check simultaniously. Why does the compiler create such a SASS and how do I stop him from doing so?
So I figured I might use constant memory as global variables (e.g. constant int Arg1). This prevents the loading to local memory and also bypasses the limit of 2048 Byte for kernel parameters, which makes me happy. However my program consists of several worker threads, which all spam the same kernel calls as they see fit. Thus all the worker threads need space for their own constant transformation matrices. But global variables are only there once, which prevents the worker threads from spamming kernel calls simultaniously. Could you please give me some hints, how I should solve this problem?