Is read/write to the same array allowed?

Hi all,

Can I first read an element of an array, and then write to it? I want to normalize some 3D vectors, and along the way save their former magnitude into the unused 4th component:

__global__ void grNormalize(const unsigned size, float4 *gr)

{

    unsigned idx = blockDim.x * blockIdx.x + threadIdx.x;

    if (idx>=size)

        return; //over size

    float4 vec=gr[idx];

    vec.w = sqrtf(dot(vec, vec));

    vec.x/=vec.w;

    vec.y/=vec.w;

    vec.z/=vec.w;

    gr[idx]=vec;

}

Is there something wrong with this function or is it the error in the rest of my program?

Regards,

Dženan

Of course you can do that. Can you show the actual code you are trying to use? (the code you posted has errors in it that would fail to compile).

Your dot product is going to include the w component. What you really want is the dot product of only the first 3 components.
Are you sure vec.w = 0 before calculating the dot product?

avidday, the whole program compiles and runs without crashing.

dittoaway, I initially thought you found the problem, but then I tried it. It makes no difference.

I initialize the 4th component to 0 at the same time I initialize xyz components, so it has no influence on dot product.

In addition to the kernel in the first post, this is the code that calls it (I removed unrelated things):

void cppFunction(unsigned xsize, unsigned ysize, unsigned zsize, float *gv)

{

    unsigned size=xsize*ysize*zsize;

    unsigned byteSize=size*sizeof(float);

    cudaExtent size3;

    size3.width=xsize;

    size3.height=ysize;

    size3.depth=zsize;

unsigned iGr=0, iGv=0;

    float *gr=(float *)malloc(byteSize*(3+1));//4 floats, 3 for vector field and 1 for scalar field

    for (unsigned z=0; z<zsize; z++)

        for (unsigned y=0; y<ysize; y++)

            for (unsigned x=0; x<xsize; x++)

            {

                gr[iGr++]=gv[iGv++]; //x

                gr[iGr++]=gv[iGv++]; //y

                gr[iGr++]=gv[iGv++]; //z

                gr[iGr++]=0; //w

            }

    float4 *gpuGr;

    cudaMalloc(&gpuGr, size*sizeof(float4));

cudaMemcpy3DParms copyParamsGr = {0};

    copyParamsGr.srcPtr=make_cudaPitchedPtr((void*)gr, xsize*sizeof(float4), xsize, ysize);

    copyParamsGr.dstPtr=make_cudaPitchedPtr((void*)gpuGr, xsize*sizeof(float4), xsize, ysize);

    copyParams.extent=size3;

    copyParamsGr.kind=cudaMemcpyHostToDevice;

    cudaMemcpy3D(&copyParamsGr);

    cutilCheckMsg("Failed to allocate vector 3D image");

grNormalize<<< ceil(size/256.0), 256 >>>(size, gpuGr);

copyParamsGr.srcPtr=make_cudaPitchedPtr((void*)gpuGr, xsize*sizeof(float4), xsize, ysize);

    copyParamsGr.dstPtr=make_cudaPitchedPtr((void*)gr, xsize*sizeof(float4), xsize, ysize);

    copyParamsGr.kind=cudaMemcpyDeviceToHost;

    cudaMemcpy3D(&copyParamsGr);

    cutilCheckMsg("Failed to copy back vector 3D image");

//cudaPrintfDisplay(std::stderr, true);

std::ofstream f("gr.bin", std::ios::binary);

    f.write((char *)gr, size*sizeof(float4));

}

There are no errors reported during execution, but the file contains junk.

Now you’ve confused me.
In your kernel, you’re dealing with a flattened 1D vector of float4 but for some reason you load and unload 3D arrays. Why not leave it 1D?

Originally, I have 3D scalar field. I order to do some computation, I also need it’s gradient vector field.
scalarField: xyz floats
vectorField: xyz float3s
since float3 is not properly supported, I use float4

And I don’t leave it 1D because I later bind it to texture, and want to have good caching organization for it.

Fine, but when you use a cudaMemcpy3D, you would normally allocate the memory using cudaMalloc3D.

But cudaMallloc3D doesn’t allow me to set elements’ byte size. This is it’s signature:

cudaError_t cudaMalloc3D (struct cudaPitchedPtr * pitchedDevPtr, struct cudaExtent extent)

Edit: cudaMalloc3DArray allows that, but I cannot write to an array.

" but I cannot write to an array. "

??

This part of the code can be 1D.

It’s the mixing that is the problem.

I meant cudaArray. I think I have read some posts about it not being directly writable by kernel functions.

Edit: That is my next step, now that you have pointed me in that direction. Right now I am trying to use cuPrintf function for debugging.

I was referring to this line in your kernel:

unsigned idx = blockDim.x * blockIdx.x + threadIdx.x;

which has no type specifier. I am surprised it compiles.

As for the rest, you will have to use linear memory for that kernel, then copy the results into a 3D array for binding to the texture. You cannot mix linear memory and arrays the way your code is currently doing.

“unsigned” is considered equivalent to “unsigned int” in every compiler I’ve used. I’m not sure if that is part of an official C standard or not.

signed and unsigned may be used as standalone type specifiers, meaning the same as signed int and unsigned int respectively.

I made some modifications, but cuPrintf produces no output, and gr array contains junk. Code:

__global__ void grNormalize(const unsigned size, float4 *gr)

{

    unsigned idx = blockDim.x * blockIdx.x + threadIdx.x;

    if (idx>=size)

        return; //over size

    float4 vec=gr[idx];

    vec.w = sqrtf(dot(vec, vec));

    vec.x/=vec.w;

    vec.y/=vec.w;

    vec.z/=vec.w;

    if (idx%100000==0)

        cuPrintf("Vec4: (%f, %f, %f) %f", vec.x, vec.y, vec.z, vec.w);

    gr[idx]=vec;

}

void entryPoint(unsigned xsize, unsigned ysize, unsigned zsize, float *gv)

{

    unsigned size=xsize*ysize*zsize;

    unsigned byteSize=size*sizeof(float);

    cudaExtent size3;

    size3.width=xsize;

    size3.height=ysize;

    size3.depth=zsize;

unsigned iGr=0, iGv=0;

    float *gr=(float *)malloc(byteSize*(3+1));//4 floats, 3 for vector field and 1 for scalar field

    for (unsigned z=0; z<zsize; z++)

        for (unsigned y=0; y<ysize; y++)

            for (unsigned x=0; x<xsize; x++)

            {

                gr[iGr++]=gv[iGv++]; //x

                gr[iGr++]=gv[iGv++]; //y

                gr[iGr++]=gv[iGv++]; //z

                gr[iGr++]=0; //w

            }

    float4 *gpuGr;

    cudaMalloc(&gpuGr, size*sizeof(float4));

    cudaMemcpy(gpuGr, gr, size*sizeof(float4), cudaMemcpyHostToDevice);

cudaPrintfInit();

    grNormalize<<< ceil(size/256.0), 256 >>>(size, gpuGr);

cudaMemcpy(gr, gpuGr, size*sizeof(float4), cudaMemcpyDeviceToHost);

    cutilCheckMsg("Failed to copy back vector 3D image");

cudaPrintfDisplay(stdout, true);

    cudaPrintfEnd();

ofstream f("gr.txt");

    for (unsigned i=150*4; i<160*4; i+=4)

        f<<"("<<gr[i]<<","

	<<gr[i+1]<<","

	<<gr[i+2]<<") "

	<<gr[i+3]<<"\n";

    f.close(); //humane output this time

}

gr.txt:

(-0.948551,0.312553,0.050619) 0.00889228

(-0.924904,0.371274,0.0819073) 0.0263282

(-0.893727,0.434073,0.113281) 0.0634654

(-0.877455,0.457744,0.14333) 0.13156

(-0.891421,0.421505,0.166439) 0.23949

(-0.931114,0.320323,0.174413) 0.394078

(-0.970894,0.177909,0.160354) 0.609986

(-0.990702,0.0457293,0.128133) 0.907309

(-0.995221,-0.0368888,0.0904122) 1.28538

(-0.996135,-0.0666854,0.0571724) 1.70843

Do you have any ideas what am I doing wrong this time?

Add some error checking. Every runtime API call returns a status, you should be checking all of them to see what they return. You should also get into the habit of calling cudaGetLastError() after a runtime API kernel launch and checking the status it returns.

I like to use something like these:

#ifndef gpuAssert

#include <stdio.h>

#define gpuAssert( condition ) { if( (condition) != 0 ) { fprintf( stderr, "\n FAILURE %s in %s, line %d\n", cudaGetErrorString(condition), __FILE__, __LINE__ ); exit( 1 ); } }

#endif

#ifndef gpuCheck

#define gpuCheck  { gpuAssert( cudaGetLastError() ) }

#endif

then you can do stuff like

int main(void)

{

const dim3 blocksz = dim3(4,16,1);

        const dim3 gridsz = dim3(16,4,1);

float *_x;

        int *_L;

gpuAssert( cudaMalloc((void **)&_x, sizeof(float) * memsize(blocksz, gridsz)) );

        gpuAssert( cudaMalloc((void **)&_L, sizeof(int) * memsize(blocksz, gridsz)) );

myMemset <float> <<< gridsz, blocksz >>> (_x, float(0.9)); gpuCheck;

        myMemset <int> <<< gridsz, blocksz >>> (_L, int(1)); gpuCheck;

        voting <<< gridsz, blocksz >>> (_x, _L, int(64), int(4)); gpuCheck;

int * L = (int *)malloc(sizeof(int) * memsize(blocksz, gridsz));

        gpuAssert( cudaMemcpy( L, _L, sizeof(int) * memsize(blocksz, gridsz), cudaMemcpyDeviceToHost) );

float * x = (float *)malloc(sizeof(float) * memsize(blocksz, gridsz));

        gpuAssert( cudaMemcpy( x, _x, sizeof(float) * memsize(blocksz, gridsz), cudaMemcpyDeviceToHost) );

free(L); free(x);

gpuAssert( cudaThreadExit() );

}

which adds error checking to every call with minimal clutter.

I was using [font=monospace]cutilSafeCall, [/font][font=“Verdana”]but it was creating some additional problems.[/font]
[font=“Verdana”]For example, [/font][font=monospace]cutilSafeCall([/font][font=“monospace”]cudaMalloc3DArray[/font]font=monospace); [/font][font=“Verdana”]was reporting[/font][font=monospace] unknown error, [/font][font=“Verdana”]whereas[/font]
[font=“monospace”]cudaMalloc3DArray(…); cudaGetLastError(); [/font][font=“Verdana”]was reporting[/font][font=monospace] cudaSuccess[/font]
[font=“monospace”]
[/font]
[font=“Verdana”]after that I removed most error checking code, only leaving a small aomount of error checking in form of[/font][font=“monospace”] cutilCheckMsg(“Point 1”);[/font]

The mistake must be somewhere in code you are not showing. When I hack together this:

#include <iostream>

#include <fstream>

using namespace std;

#ifndef gpuAssert

#include <stdio.h>

#define gpuAssert( condition ) { if( (condition) != 0 ) { fprintf( stderr, "\n FAILURE %s in %s, line %d\n", cudaGetErrorString(condition), __FILE__, __LINE__ ); exit( 1 ); } }

#endif

#ifndef gpuCheck

#define gpuCheck  { gpuAssert( cudaGetLastError() ) }

#endif

__device__ float dot( const float4 v, const float4 w )

{

    return (v.x*w.x) + (v.y*w.y) + (v.z*w.z) + (v.w*w.w);

}

__global__ void grNormalize(const unsigned size, float4 *gr)

{

    unsigned idx = blockDim.x * blockIdx.x + threadIdx.x;

    if (idx>=size) return; //over size

    float4 vec=gr[idx];

    vec.w = sqrtf(dot(vec, vec));

    vec.x/=vec.w;

    vec.y/=vec.w;

    vec.z/=vec.w;

    gr[idx]=vec;

}

void entryPoint(unsigned xsize, unsigned ysize, unsigned zsize, float *gv)

{

    unsigned size=xsize*ysize*zsize;

    unsigned byteSize=size*sizeof(float);

unsigned iGr=0, iGv=0;

    float *gr=(float *)malloc(byteSize*(3+1));//4 floats, 3 for vector field and 1 for scalar field

    for (unsigned z=0; z<zsize; z++)

        for (unsigned y=0; y<ysize; y++)

            for (unsigned x=0; x<xsize; x++)

            {

                gr[iGr++]=gv[iGv++]; //x

                gr[iGr++]=gv[iGv++]; //y

                gr[iGr++]=gv[iGv++]; //z

                gr[iGr++]=0; //w

            }

    float4 *gpuGr;

    gpuAssert( cudaMalloc(&gpuGr, size*sizeof(float4)) );

    gpuAssert( cudaMemcpy(gpuGr, gr, size*sizeof(float4), cudaMemcpyHostToDevice) );

grNormalize<<< ceil(size/256.0), 256 >>>(size, gpuGr); gpuCheck;

gpuAssert( cudaMemcpy(gr, gpuGr, size*sizeof(float4), cudaMemcpyDeviceToHost ) );

ofstream f("gr.txt");

    for (unsigned i=0; i<32; i+=4)

        f<<"("<<gr[i]<<","

<<gr[i+1]<<","

        <<gr[i+2]<<") "

        <<gr[i+3]<<"\n";

    f.close(); //human output this time

}

int main(void)

{

const unsigned int xs = 128, ys = 128, zs = 128;

    const unsigned int sz = 3 * xs * ys * zs;

float * v = (float *)malloc( size_t(sz) * sizeof(float) );

    for(unsigned int i=0; i< sz; i++) { 

        v[i] = (float)i / 1.e6f;

    }

entryPoint(xs, ys, zs, v);

gpuAssert(cudaThreadExit());

}

It compiles and runs like this:

avidday@cuda:~$ nvcc -arch=sm_20 kuusipää.cu -o kuusipää

avidday@cuda:~$ ./kuusipää 

avidday@cuda:~$ cat gr.txt 

(0,0.447214,0.894427) 2.23607e-06

(0.424264,0.565685,0.707107) 7.07107e-06

(0.491539,0.573462,0.655386) 1.22066e-05

(0.517892,0.575435,0.632979) 1.73781e-05

(0.531891,0.576215,0.620539) 2.2561e-05

(0.540563,0.5766,0.612638) 2.77489e-05

(0.546459,0.576818,0.607177) 3.29393e-05

(0.550728,0.576953,0.603178) 3.81314e-05

and those results look about right to my untrained eye.

WTF was I thinking? Maybe I got used to not getting anything done with CUDA code, so I did not realize this was the result I was supposed to be getting!
I just realized that the very small gradient magnitude values are not junk, but indication of very small gradient in the upper left corner of the image :D. I am soooo embarrassed.

Thank you avidday and ditoaway, you helped me really much.

Now off to doing the stuff which uses this image :D. Hopefully I will not require much help with it, since I am armed with the experience acquired so far