Being fairly new to CUDA (but experienced in C) i’d like to ask you for some advice in designing a cuda program:
So far i’ve written a framework for handling volumetric data sets (i.e. downloaded from volvis.org) and managed to transfer them to the device, applying some basic filtering and returning the results to the host.
Let’s assume i just want to sum up the 3x3x3 neighbourhood of each voxel and divide the value by 27 - i.e. implementing a simple “box filter” on the dataset:
how would you design such a program? those datasets are up to 1 GB in size - how should i map this task on CUDAs Kernel, Block, Grid Model? Where should i put the data? Constant or global mem? Should i transfer sub-volumes to the shared mem? I actually do have a working solution, i just don’t know if it is efficient…
Should one kernel process one new value of the smoothed dataset?
I’m just curious what tricks and hints you can give me and if you could point out possible problems or best practices…
maybe just a few lines of pseudocode / how you’d arrange the steps would give me a clue whether i’m on track in terms of “thinking CUDA”.
thanks a lot! :)