I am working with CUDA on a windows platform. I had a query regarding initialization of an array on shared memory. If the array is initialized as follows :__shared float Array[16 * 16] it works fine. But when I try to pass in the size of the array as an argument to the kernel function (for example __shared float Array [m] where m is passed in as an argument) it results in a build error: expression must have a constant value. Is it not possible to declare the size of the array on the fly? I tried using the extern __shared float declaration but I am using quite a few variables and am facing memory overlap problems. I would greatly appreciate any help.
This is explained in the programming guide. Because it’s not possible to allocate memory on the fly inside global or device functions, you have to either allocate shared arrays statically, or pass the size through the execution configuration (the <<< >>> parameters). For the same reason, you can allocate at most one dynamic shared array per kernel invocation. Here’s some example code:
__global__ void myKernel(float *deviceData)
{
extern __shared__ float myData[];
// kernel code...
}
void main()
{
// initialization, data download, etc.
int sharedMemSize = calcSharedMemSize(); // depends on the problem
myKernel<<<threads, blocks, sharedMemSize>>>(data);
// shutdown
}
The key here is that you declared the shared array extern to tell the compiler it would be allocated elsewhere. Then at runtime the cuda runtime API allocates the shared memory based on the third parameter in the execution configuration. Because of this, only one dynamically-sized shared memory array per kernel is supported.
Thanks Mark. I did try that too. Infact I tried giving the offsets myself as mentioned in the guide for dynamically allocated arrays but with the number of variables I am using, it wasn’t feasible. So I would have to statically allocate the size of the arrays.
n is the height of the array which is computed in my main cpp file by using the sizeof() function and passed into
global void
kernel( float* C, float* A, float* B, float* D, unsigned int n)
{
// access thread id
const unsigned int tid = threadIdx.x; // 0-3
// warp id
const unsigned int wid = threadIdx.y; // 0-n
const unsigned int ltid = (wid * blockDim.x) + tid;
// access number of threads in this block
const unsigned int num_threads = blockDim.x * blockDim.y;
(Note, please wrap code in code blocks using the buttons on the edit pane so that the forum will not reformat your code.)
I see. As I said, there is no dynamic memory allocation inside CUDA kernel code. What you could do is declare one array and define offsets and macros to access “virtual” arrays.
// define some macros to access your "virtual" arrays
#define As(i) data[AsIndex + (i)]
#define Bs(i) data[BsIndex + (i)]
#define Cs(i) data[CsIndex + (i)]
#define Ds(i) data[DsIndex + (i)]
// etc...
__global__ void myFunc(float *deviceDataA, float *deviceDataB);
{
extern __shared__ float data[];
unsigned AsIndex = 0;
unsigned BsIndex = n
unsigned BsIndex = n+3;
unsigned CsIndex = 2*n+3;
unsigned DsIndex = 4*n+3;
// etc...
// now, when you need to access the "virtual" arrays, just use the macros:
As(threadIdx.x) = deviceDataA[threadIdx.x];
Bs(threadIdx.x) = deviceDataB[threadIdx.x];
// etc...
}
Now you just allocate “data” at runtime by using the execution configuration.
Thanks a ton. It worked perfectly. But there is a limit on the amount of shared memory I can allocate from the execution configuration right?? Is there a way I could get around this as I am trying to develop a generic program that could do the computation on any number of floating points. In my other post I have mentioned block sizes which I think would be the best solution for this. But I am confused about how I could use blocks with the number of variables I am using. Also there is a sequence of operations I am trying to perform on the data.
There is a physical limitation of 16KB per thread block on G80. You have to work within that physical limit. You can always re-use memory space later by re-using temporary storage for intermediates and loading more data later in the kernel to replace data you are finished with.
To extend the earlier example:
// define some macros to access your "virtual" arrays
#define As(i) data[AsIndex + (i)]
#define Bs(i) data[BsIndex + (i)]
#define Cs(i) data[CsIndex + (i)]
#define Ds(i) data[DsIndex + (i)]
#define Temp(i) data[(i)]
// etc...
__global__ void myFunc(float *deviceDataA, float *deviceDataB,
float * deviceDataC, float * deviceDataD,
float *deviceDataOut);
{
extern __shared__ float data[];
unsigned temp = 0;
unsigned AsIndex = n;
unsigned BsIndex = 2*n
// note that we restart Cs at n. This means we can't use Cs and Ds at the same
// time as As, and Bs
unsigned CsIndex = n;
unsigned DsIndex = 2*n;
// etc...
int tid = threadIdx.x;
// now, when you need to access the "virtual" arrays, just use the macros:
As(tid) = deviceDataA[tid];
Bs(tid) = deviceDataB[tid];
__syncthreads();
// compute using As and Bs and store in temp
Temp(tid) = As(tid) + Bs(tid);
// now load in C and D:
Cs(tid) = deviceDataC[tid];
Ds(tid) = deviceDataD[tid];
__syncthreads();
// compute using temp, As, and Bs and write to output
deviceDataOut[threadIdx.x] = Cs(tid) * (Ds(tid) + Temp(tid));
}
Also, for constant data, consider using constant memory or texture memory.
extern __shared__ float data[];
__device__ func() // some device or __global__ function
{
float *As = data;
float *Bs = data+183; // 183 is size of As
}
of this:
extern __shared__ float data[];
__shared__ float *As = data;
__shared__ float *Bs = data+183; // 183 is size of As
The programming guide is outdated in this regard.
Section 4.2.2.4 should say:
[i]Pointers in code that is executed on the device are supported as long as the compiler is able to resolve whether they point to either the shared memory space or the global memory space, otherwise they are restricted to only point to memory allocated or declared in the global memory space.
Dereferencing a pointer either to global or shared memory in code that is executed on the host or to host memory in code that is executed on the device results in an undefined behavior, most often in a segmentation fault and application termination.[/i]
One additional note, Shyam. Inability to allocate arrays on the fly as you tried (float Array[m]) is not unique to CUDA. You cannot do this in C or C++ since, as the compiler complained, array dimension in the declaration must be a constant.
The g++ compiler will let you declare arrays with sizes that are variables, even though it is not part of the C++ standard. This has confused a number of people I work with when they moved to a Solaris compiler which did not support that extension.
When assigning a constant value to shared memory, is it better to let one thread do the work, or is there an optimization that occurs? For example:
__shared__ float x;
x = -1.0f; // is this better?
if (tid == 0) x = -1.0f; // or this?
The ptx code for the straight assignment is just two instructions, while the if(tid) approach is 4 instructions.
Will the straight assignment will get serialized, with each thread repeating the assignment in turn? Or is there some kind of special treatment for a constant assignment to shared memory?
The manual says that the result of concurrent writes to the same memory location is undefined. As long as all threads write the same value, you can probably get away with it, but you better stick with the second version as the threads might not be in sync and so a second thread might write the -1 while the first has already moved on modifying x further. If x isn’t modified further down, you should make it constant.
My hunch is that the second is better, but you need to follow it with a __syncthreads() to avoid the issue Peter mentions. Also, you may be able to do this:
if (tid < 32) x = -1.0f;
__syncthreads();
This would mean the entire first warp writes the value, so that the first warp is non-divergent, unlike the (tid == 0) case. However, divergence in only one warp is probably not much faster overall than divergence in zero warps, unless you have very few warps per block (which is usually bad for efficiency anyway).
My question was specifically targeting whether there was some subtle optimization for initializing shared memory to a constant. Apparently there’s not.
I suspect, then that the method where ALL threads write the (same) constant to one shared memory location would cause bank conflicts? I didn’t detect any, but it’s a pretty small part of my code. If it did, that would settle the issue. :P
I left out the __syncthreads() explicitly because I was initializing other shared variables. But point well taken!