Writing to device memory from Cuda function

ok, first post I’ve ever done on a forum, but I cannot seem to figure this out for the life of me. I’ve got CUDA running just fine and I can compile the kernel module using the Runtime API, but I cannot seem to figure out why I cannot write to device memory from within the single cuda function that I have (see code below)…

#include <stdio.h>

#define RADIUS 25


__device__ unsigned int

make_color(float b, float g, float r, float a)



        ((int)(a * 255.0f) << 24) |

        ((int)(r * 255.0f) << 16) |

        ((int)(g * 255.0f) <<  8) |

        ((int)(b * 255.0f) <<  0);


extern "C" __global__ void

cu_radialNoise(unsigned int *d_odata, unsigned int *loc, unsigned int *color, int size_x, int size_y, int nCircles)


    int offset = threadIdx.x + blockIdx.x * blockDim.x;


    d_odata[offset] = make_color(1.0, 0.0, 0.0, 0.0);


In this code, d_odata should be the CUdeviceptr that I sent into the function. I was trying to do something similar to the code that my professor had given us, but it will not work, and he’s been absolutely no help. Here’s the Runtime code that I’m using to call the function. I realize that its not very optimal right now, I’ve just been trying everything to get this working. Right now I’m just trying to turn the pixel / unsigned integer array to completely red. We have to generate bitmap textures, and until I figure out why I can’t write to device memory, it’s pointless for me to try something more interesting yet.


    CUdevice cuDevice = 0;

    CUcontext cuContext = 0;

cuDeviceGet(&cuDevice, 0);

    cuCtxCreate(&cuContext, 0, cuDevice);

CUmodule cuModule = 0;

    CUfunction rNoise = 0;

    cuModuleLoad(&cuModule, "radial.ptx");

    cuModuleGetFunction(&rNoise, cuModule, "cu_radialNoise");

CUstream cuStream = 0;

    CUevent start = 0;

    CUevent end = 0;

    cuStreamCreate(&cuStream, 0);


    cuEventCreate(&end, 0);

CUdeviceptr d_odata = 0;

    CUdeviceptr d_loc = 0;

    CUdeviceptr d_col = 0;

    cuMemAlloc(&d_odata, (unsigned int)(4*size_x*size_y));

    //cuMemcpyHtoD(d_odata, cuImg, (unsigned int)(4*size_x*size_y));

    cuMemAlloc(&d_loc, (unsigned int)(4 * 3 * nCircles));

    cuMemcpyHtoD(d_loc, cuLocation, (unsigned int)(4 * 3 * nCircles));

    cuMemAlloc(&d_col, (unsigned int)(4 * 3 * size_x * size_y));

    cuMemcpyHtoD(d_col, colorsRGB, (unsigned int)(4 * 3 * size_x * size_y));

int offset=0;

    cuParamSeti(rNoise,offset,(unsigned int)d_odata);

    offset += sizeof(d_odata);

    cuParamSeti(rNoise,offset,(unsigned int)d_loc);

    offset += sizeof(d_loc);

    cuParamSeti(rNoise,offset,(unsigned int)d_col);

    offset += sizeof(d_col);

//    cuParamSetv(rNoise, offset, &d_odata, (unsigned int)(4*size_x*size_y));

//    offset += sizeof(d_odata);

//    cuParamSetv(rNoise, offset, &d_loc, (unsigned int)(4*3*nCircles));

//    offset += sizeof(d_loc);

//    cuParamSetv(rNoise, offset, &d_col, (unsigned int)(4*3*size_x*size_y));

//    offset += sizeof(d_col);

    cuParamSeti(rNoise, offset, (unsigned int)size_x);

    offset += sizeof(int);

    cuParamSeti(rNoise, offset, (unsigned int)size_y);

    offset += sizeof(int);

    cuParamSeti(rNoise, offset, (unsigned int)nCircles);

    offset += sizeof(int);

    cuParamSetSize(rNoise, (unsigned int)offset);

cuEventRecord(start, cuStream);

cuFuncSetBlockShape(rNoise, 1, 1, 1);

    cuLaunchGrid(rNoise, (size_x + 7)/size_x, (size_y + 7) / size_y);


cuMemcpyDtoH(pimageData, (CUdeviceptr)d_odata, (unsigned int)(4*size_x*size_y));


    cuEventRecord(end, cuStream);


GpuTime = 0.0;

    cuEventElapsedTime(&GpuTime, start, end);






As I mentioned, the call works and there are no issues to there, but I want to write to the device memory (turning the entire image red for now (to do something a little more interesting when I can actually write and read from the unsigned integer array.

Thanks for anyone’s help… I have a feeling it’ll be something stupid, but the teacher hasn’t explained any of it, and I cannot seem to figure out why what was in the Matrix Multiplication example he’d given us doesn’t work. Looking at all of the other CUDA examples hasn’t helped me yet either… I have to just be overlooking something.

Nevermind… resolved.