This error generally means that the #regs, #threads, or amt. of shared memory is too big, so occupancy is zero.
Given your description of the ptxas -v output, is very likely that in this case it’s being caused by the amount of shared memory being too big. You don’t mention how much dynamic smem you allocate at kernel launch… any?
Why this would be different from one machine to another is unclear, but as far as I know for this error code as long as you’re running the same driver, toolkit, and application code on each machine, there shouldn’t be any reason why one GPU would work and another one of the same compute capability would not.
–Cliff