dynamic array allocation

lwan61c1t3 · September 18, 2014, 7:27pm

Hi All,

I am trying to dynamically allocate a small array within a thread. However, the keyword “register” seems not work.

I have run a test using the following code. The running time (5263 ms) is the same as that without “register” (5264 ms), which probably means the array is still in global memory.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <time.h>

#define N_RUN 10000000

__global__ void Kernel_TestSpeed(int *dev_icount)
{
    int icount = 0;
    int value = 1;

    int n = 1;
    //register int *element = new int[n];   // dynamically allocate a small array in register ??
    int *element = new int[n];

    do 
    {   
        element[0] = value;

        icount++;
    }while(icount < N_RUN); // run 10 M steps

delete [] element;

    dev_icount[0] = value;
}

int main()
{
    int icount = 1;
    int *dev_icount;

    cudaMalloc((void**)&dev_icount, sizeof(int));

    clock_t t1, t2;
    t1 = clock();

    Kernel_TestSpeed <<< 1, 32 >>> (dev_icount);
    cudaMemcpy(&icount, dev_icount, sizeof(int), cudaMemcpyDeviceToHost);

    t2 = clock();

    printf("Running Time: %.3f ms\n", (double)(t2 - t1) / CLOCKS_PER_SEC * 1000);
    printf("Result: %d\n", icount);

    getchar();
    cudaFree(dev_icount);
}

I am wondering whether there is a way to dynamically allocate a small array in register? I am using VS 2010 + CUDA 6.0 + K20C

Many thanks in advance!

njuffa · September 18, 2014, 10:22pm

Side remark: The ‘register’ keyword is pretty much ignored by all modern C/C++ compilers independent of platform.

In general register files in modern CPUs and GPUs cannot be indexed, so in order to allocate an array variable in registers, the array size must be known at compile time, all indexing of the array must resolve to compile-time constants, and the array must be small (with a size limit that depends on the hardware platform and/or compiler heuristics).

A suitable choice for dynamically allocated relatively fast storage in the context of CUDA could be shared memory.

CudaaduC · September 19, 2014, 1:37am

As njuffa mentioned, one way to dynamically allocate an array within a thread block is via the extern declaration and adding the size in the kernel launch.

This thread goes into detail:

[url]c++ - allocating shared memory - Stack Overflow

scroll down to talonmies response.

lwan61c1t3 · September 19, 2014, 2:29pm

That is

Thanks for your suggestion.