I’m currently trying to implement several image processing filters using CUDA. I’m relatively new to CUDA programming.
One problem I keep running into is allocating shared memory. For example, a median filter requires an array whose size depends on a variable filter radius which is set by the user at runtime. Obviously I can’t use this value to set the array size as it’s variable. At present I use a maximum value (#defined) although this seems wasteful in the case of small radii.
What’s the best way to allocate variable amounts of shared memory?
Any call to a global function must specify the execution configuration for that call.
The execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device. It is specified by inserting an expression of the form <<< Dg, Db, Ns >>> between the function name and the parenthesized argument list, where:
Dg is of type dim3 (see Section 4.3.1.2) and specifies the dimension and size of the grid, such that Dg.x * Dg.y equals the number of blocks being launched;
Db is of type dim3 (see Section 4.3.1.2) and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block;
Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an external array as mentioned in Section 4.2.2.3; Ns is an optional argument which defaults to 0.
The arguments to the execution configuration are evaluated before the actual function arguments.