How can I specify the ‘maxrregcount’ option to the ptx assembler through OpenCL? Passing “–maxrregcount=64” to clBuildProgram gives me the following error message:
I realize this will spill some variables out to global memory, but I’m ok with that. The inner loop of my algorithm has 20-40 texture accesses. I know that loop can execute with only 64 registers based on executing it in isolation. That loop is then executed 20 times by another loop. So I can afford a few global memory accesses at that level since they will occur so infrequently. The entire purpose of this exercise is to increase the work-group size executing on a single multiprocessor. Currently, I’m bounded to 192 work items by register pressure. I’ve heard somewhere that at least 256 work items are needed to start effectively hiding texture latency. Furthermore, texture latency appears to be a constant constraint based on this paragraph from the CUDA best practices guide:
I’m also curious what will change with textures in the upcoming Fermi architecture? The white papers appear to be sparse on the issue.