Acceleration strategy for voxel traversal in uniform grids ?

Hi everyone

I am working on a physics problem to calculate the accumulated radiological path length through a set of CT images. However, in order to calculate the accumulated path length, the distance of every ray traveled in each voxel needs to be scaled with the intensity of that voxel in the image. Therefore it is different from the other ray tracing problems for rendering purposes, i.e I am not using KD tree or anything like that, just pure iterative voxel traversal through uniform grids.

My setup is to used Amanatides’ algorithm for voxel traversal for each ray per kernelhttp://www.cse.yorku.ca/~amana/research/grid.pdf. Currently it is about 20 times faster than CPU code by using texture, but I would like to hear your suggestions about how to make it better. This solution is inherently memory inefficient, in each iteration, each kernel is accessing almost various locations of the global memory for the image intensity in individual voxels. I think people may have experienced similar issues for Monte Carlo simulations on GPU, but not sure how they solved it. Any suggestions will be appreciated. Thanks !