Circular buffer class on device (new[] operator)

I am writing a circular buffer class (actually a struct) for running on the device (i.e. kernel). How can I allocate memory in the constructor? It seems like I wasn’t allowed to allocate memory once I’m running inside a kernel, so any suggestion on how to implement this?

Thanks.

I am writing a circular buffer class (actually a struct) for running on the device (i.e. kernel). How can I allocate memory in the constructor? It seems like I wasn’t allowed to allocate memory once I’m running inside a kernel, so any suggestion on how to implement this?

Thanks.

cuda 3.2 makes this possible with fermi class hardware. Check the malloc function call support that is new in 3.2

cuda 3.2 makes this possible with fermi class hardware. Check the malloc function call support that is new in 3.2

I’m using GTX 450 (Fermi) with CUDA Toolkit 3.2, and the host is MATLAB (so I’m compiling the PTX file first). For the two lines in my kernel,

char* ptr = (char*)malloc(123);

free(ptr);

I got:

D:/Work/Research/MAF/simulation/gpu/sysID_branch.cu(146): error: calling a host function from a device/global function is not allowed

D:/Work/Research/MAF/simulation/gpu/sysID_branch.cu(147): error: calling a host function from a device/global function is not allowed

Did I miss something? The same thing happens when I tried printf().

I’m using GTX 450 (Fermi) with CUDA Toolkit 3.2, and the host is MATLAB (so I’m compiling the PTX file first). For the two lines in my kernel,

char* ptr = (char*)malloc(123);

free(ptr);

I got:

D:/Work/Research/MAF/simulation/gpu/sysID_branch.cu(146): error: calling a host function from a device/global function is not allowed

D:/Work/Research/MAF/simulation/gpu/sysID_branch.cu(147): error: calling a host function from a device/global function is not allowed

Did I miss something? The same thing happens when I tried printf().

the -arch sm_20 command line option to nvcc is probably needed. You can check how printf should work since there is a printf example in the SDK.

the -arch sm_20 command line option to nvcc is probably needed. You can check how printf should work since there is a printf example in the SDK.

After I changed from -arch=sm_13 to -arch=sm_20 and I got the following error when MATLAB is trying to generate the kernel from the PTX file:

"??? Error using ==> parallel.gpu.CUDAKernel

An error occurred during PTX compilation of .

The information log was:

: Considering profile ‘compute_20’ for gpu=‘sm_21’ in ‘cuModuleLoadDataEx_136’

: Retrieving binary for ‘cuModuleLoadDataEx_136’, for gpu=‘sm_21’, usage mode=’ ’

: Considering profile ‘compute_20’ for gpu=‘sm_21’ in ‘cuModuleLoadDataEx_136’

: Control flags for ‘cuModuleLoadDataEx_136’ disable search path

: Ptx binary found for ‘cuModuleLoadDataEx_136’, architecture=‘compute_20’

: Ptx compilation for ‘cuModuleLoadDataEx_136’, for gpu=‘sm_21’, ocg options=’

The error log was:

The CUDA error code was: CUDA_ERROR_INVALID_IMAGE."

Any ideas?

After I changed from -arch=sm_13 to -arch=sm_20 and I got the following error when MATLAB is trying to generate the kernel from the PTX file:

"??? Error using ==> parallel.gpu.CUDAKernel

An error occurred during PTX compilation of .

The information log was:

: Considering profile ‘compute_20’ for gpu=‘sm_21’ in ‘cuModuleLoadDataEx_136’

: Retrieving binary for ‘cuModuleLoadDataEx_136’, for gpu=‘sm_21’, usage mode=’ ’

: Considering profile ‘compute_20’ for gpu=‘sm_21’ in ‘cuModuleLoadDataEx_136’

: Control flags for ‘cuModuleLoadDataEx_136’ disable search path

: Ptx binary found for ‘cuModuleLoadDataEx_136’, architecture=‘compute_20’

: Ptx compilation for ‘cuModuleLoadDataEx_136’, for gpu=‘sm_21’, ocg options=’

The error log was:

The CUDA error code was: CUDA_ERROR_INVALID_IMAGE."

Any ideas?

I see messages of sm_21. Do you have a sm_21 device?

I see messages of sm_21. Do you have a sm_21 device?

I have no idea. It’s a GTS 450. I changed to arch=sm_21 and got this:

??? Error using ==> parallel.gpu.CUDAKernel

An error occurred during PTX compilation of .

The information log was:

: Considering profile ‘compute_20’ for gpu=‘sm_21’ in ‘cuModuleLoadDataEx_139’

: Retrieving binary for ‘cuModuleLoadDataEx_139’, for gpu=‘sm_21’, usage mode=’ ’

: Considering profile ‘compute_20’ for gpu=‘sm_21’ in ‘cuModuleLoadDataEx_139’

: Control flags for ‘cuModuleLoadDataEx_139’ disable search path

: Ptx binary found for ‘cuModuleLoadDataEx_139’, architecture=‘compute_20’

: Ptx compilation for ‘cuModuleLoadDataEx_139’, for gpu=‘sm_21’, ocg options=’

The error log was:

The CUDA error code was: CUDA_ERROR_INVALID_IMAGE.

I have no idea. It’s a GTS 450. I changed to arch=sm_21 and got this:

??? Error using ==> parallel.gpu.CUDAKernel

An error occurred during PTX compilation of .

The information log was:

: Considering profile ‘compute_20’ for gpu=‘sm_21’ in ‘cuModuleLoadDataEx_139’

: Retrieving binary for ‘cuModuleLoadDataEx_139’, for gpu=‘sm_21’, usage mode=’ ’

: Considering profile ‘compute_20’ for gpu=‘sm_21’ in ‘cuModuleLoadDataEx_139’

: Control flags for ‘cuModuleLoadDataEx_139’ disable search path

: Ptx binary found for ‘cuModuleLoadDataEx_139’, architecture=‘compute_20’

: Ptx compilation for ‘cuModuleLoadDataEx_139’, for gpu=‘sm_21’, ocg options=’

The error log was:

The CUDA error code was: CUDA_ERROR_INVALID_IMAGE.

the devicequery example from the sdk can tell you the compute capability of your card. My guess is that your GPU is not 2.x capable so it cannot do in-kernel malloc on the device.

the devicequery example from the sdk can tell you the compute capability of your card. My guess is that your GPU is not 2.x capable so it cannot do in-kernel malloc on the device.

Here’s my GPU’s device query:

gpuDevice

ans =

parallel.gpu.CUDADevice handle

Package: parallel.gpu

Properties:

                  Name: 'GeForce GTS 450'

                 Index: 1

     ComputeCapability: '2.1'

        SupportsDouble: 1

         DriverVersion: 3.2000

    MaxThreadsPerBlock: 1024

      MaxShmemPerBlock: 49152

    MaxThreadBlockSize: [1024 1024 64]

           MaxGridSize: [65535 65535]

             SIMDWidth: 32

           TotalMemory: 1.0417e+009

            FreeMemory: 996872192

   MultiprocessorCount: 4

  GPUOverlapsTransfers: 1

KernelExecutionTimeout: 0

       DeviceSupported: 1

        DeviceSelected: 1

Seems like I have compute capability 2.1.

Thanks.

Here’s my GPU’s device query:

gpuDevice

ans =

parallel.gpu.CUDADevice handle

Package: parallel.gpu

Properties:

                  Name: 'GeForce GTS 450'

                 Index: 1

     ComputeCapability: '2.1'

        SupportsDouble: 1

         DriverVersion: 3.2000

    MaxThreadsPerBlock: 1024

      MaxShmemPerBlock: 49152

    MaxThreadBlockSize: [1024 1024 64]

           MaxGridSize: [65535 65535]

             SIMDWidth: 32

           TotalMemory: 1.0417e+009

            FreeMemory: 996872192

   MultiprocessorCount: 4

  GPUOverlapsTransfers: 1

KernelExecutionTimeout: 0

       DeviceSupported: 1

        DeviceSelected: 1

Seems like I have compute capability 2.1.

Thanks.

Then I would file a bug-report with the Mathworks. ( support@mathworks.com : their support is great)

Then I would file a bug-report with the Mathworks. ( support@mathworks.com : their support is great)