[device code, with data and moreData passed to kernel]
for (i=a; i<=b; i++) data[i] = (float*)&(moreData[i]);
and then access via data[i][j]?
I guess this is mostly a newbie question about whether I can typecast &data as float**, and whether layered square brackets always cycle through pointers reliably on devices.
From a first glance I don’t see why not. But I can see the memory access pattern being misaligned and a bit slow. I assume you’ve looked through the documentation for cudamallocpitch?
The way that code is implies that it is a ragged array, but is it ?
If the j lengths for the different [i] are all similar then it could be worthwhile taking them all up to the same value (a multiple of 16) and making it a regular array.