Beginner Question: Writing uchar4 to memory

I wrote teh following Function:

void KleineTestfunktion( void )

{

     uchar4   *pToTest;

     cudaMalloc( (void**) &pToTest,	3 * 3 * sizeof(uchar4) );

	

     uchar4 tmp = { 0x00, 0x00, 0x00, 0x00 };

    for(int i = 0; i < 9; i++)

     {

          // If I uncommend the following line, the execution fails

          //*(pToTest + i) = tmp;

         tmp.x ++; tmp.y ++; tmp.z ++; tmp.w ++;

     }

	

     cudaFree(pToTest);

}

I tried to write the “tmp”-uchar4 to the memory. Why it does not work?

Thanks for ur help!

EDIT:

I found out:

If i fill the memory with a kernel it runs.

But why do i have to use a kernel?

cudaMalloc allocates memory on the device, not in your systems RAM. If you try to dereference that pointer on the host, you are going to be reading/writing some random area in memory and get a seg fault.

If you want to initialize the device memory to 0 from the host, use cudaMemset

I think this brought me much further. I was hunting an error in an other projekt, where i programmed an image-convolution-filter.

In my dll I called my kernel like following:

float operatorMask[5][5] = 

{ { 0, 0,-1, 0, 0},

   { 0,-1,-2,-1, 0},

   {-1,-2,16,-2,-1},

   { 0,-1,-2,-1, 0},

   { 0, 0,-1, 0, 0} };

ImageConvolutionFromTexutre<<<gridDim_2D_2, blockDim_2D_2>>>(tempIamge_2_uchar4, &operatorMask[0][0], 2, 2);

I assume the kernel execution faild, because my operatorMask was stored into the CPU-Ram and not into the GPU-Ram.

It should be the best to store my operatorMask into the constant memory?

EDIT:

Thanks… it runns fine now!

A texture is usually somewhat faster if each thread accesses a different element.