Having 2d array issues

I’m developing a particle simulation using CUDA and c. I’m having quite a bit more trouble than I should with 2d arrays (and random other memory issues).

The following is a simplified test program to illustrate some of the issues I’m having in my main code right now:

#include <stdio.h>
#include <cuda_runtime.h>
#include "visual/myhelpers.h"

float *fDev;
size_t fPitch;

const int npart = 50;
const int ndim = 2;

__global__ void kernel(float *myfl, int fpitch) {

        const int n = threadIdx.x;
        const int dim = threadIdx.y;

        float *f = (float*)((char*)myfl + dim * fpitch) + n;

        *f = 0;

        __syncthreads();

        printf("n=%i *f= %f / %f, n: %i, dim:%i \n", n, f[n], f[n],  n, dim);

        if (*f !=0)
                printf("BUGBUGBUG *f= %f, n: %i, dim: %i\n", *f, n, dim);

}

void init() {
        HANDLE_ERROR(cudaMallocPitch(&fDev, &fPitch, npart * sizeof(float), ndim));
}

int main() {
        dim3 threads = dim3(npart, ndim);
        init();

        HANDLE_ERROR(cudaMemset2D(fDev, fPitch, 0, npart*sizeof(float), ndim));
        kernel<<<1, threads>>>(fDev, fPitch);
        cudaDeviceSynchronize();

        return 0;
}

myhelpers.h is only there for HANDLE_ERROR, which you can imagine the function of.

This code is not tripping the BUGBUGBUG, but is still coming back with *f = -nan sometimes.

To explain some weirdnesses in the code: commenting out the *f=0; does nothing. I eventually want to be rid of that line, but the cudaMemset2d call isn’t always doing what it should (so that’s a backup). The printf line prints *f twice because I was originally printing %x for the second one - that gives even weirder results, making the second n print out a strange large value, and the dim prints out n.

When my program itself is run, I tend to get the BUGBUGBUG (i.e. *f !=0) to trigger fairly often - it will either be nan, or one of 2 or 3 small numbers (seem to correspond to values 0x80000000, 0xa0000000 or 0xe0000000). It also seems to be limited to a few different values of n (in particular 25 and 46).

I’m compiling with: nvcc -arch=sm_30 -g -G -o bugtest bugtest.cu, on a system with two Quadro k5000 cards (obviously only using one at a time).

Any ideas? I’ve been just scratching my head on this one.

“float f = (float)((char*)myfl + dim * fpitch) + n;”

what exactly are you attempting to do here…?

“const int dim = threadIdx.y;”

are you sure about this - should it not be blockIdx.x…?
based on you dimensions, threadIdx.x == 0 for all threads, if i am not mistaken

The documentation for cudaMallocPitch says to access an array like this:

T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;

…so did I do it wrong?

Also, the kernel is called with <<<1,(50,2)>>, so blockIdx.x would be 0 always (my real code is more complicated than this, it gets called with threads=(16, 16, 2) and blocks equal to the necessary multiplier to get the number of particles). In either case, n and dim are correct according to the printout, so that’s not my issue.

a number of concerns:

  1. fPitch is size_t; yet, you pass it to the kernel as int

  2. you calculate the address of a pitch element with:
    float f = (float)((char*)myfl + dim * fpitch) + n;

a) in principle this is correct; however, the column number - n - is left outside of the bracket, and can upset the address, as it would be conditional on the size of the data type - float in this case, which is more than 1 byte

b) you then subsequently use f[n], which is really not advisable since f is now an element address, not a row address

instead consider using
float f = (float)((char*)myfl + dim * fpitch);
in conjuction with f[n];

Okay, I left some f[n]'s in there from various tests with different ways to dereference the pointer. This test code works fine either as I have it with f, or as you suggested with f[n]. A note: it is correct to leave the +n outside the bracket, as this adds n to the float, which increments it by n*sizeof(float) bytes. Only the pitch belongs in there, since it is in bytes. Again, that line is straight out of the cuda documentation.

I have the types correct as size_t now in both the test program and the real program.

The bug seems to come in when I have other seemingly unrelated code nearby. The more I strip out of the code, the less likely it is to occur, but none of the code I take out has anything to do with f, *f, or anything like that. I apologize that it’s a bit ugly (since there’s a lot of functionality that I’ve taken out, and I’m still working on optimization, ofc), but http://polypux.org/hidden/bug.zip is a fairly stripped-out test case that does bug out to a significant degree. I’m compiling with “nvcc -arch=sm_30 -o striptest striptest.cu”. If someone could take a look at that and tell me why f is not filled with 0’s, that would be fantastic.

“…and tell me why f is not filled with 0’s, that would be fantastic”

I think you are really facing 2 possible cases here:
a) the array pointed to by f is not filled correctly with 0s; hence, any read of the array pointed to by f would not be filled with 0s
b) the array pointed to by f is indeed filled correctly with 0s; but subsequent reads of the array are done improperly, resulting in an answer not filled with 0s

(The easiest to test for b) would be simply to copy the array to a host array, and to read/ test the values on the host)

I had to ditch 2d arrays in my kernel yesterday because threads were somehow accessing and modifying data in regions that they were never told to. I am using 2 1d arrays now. I submitted a bug report. Good luck.

So, I spent yesterday learning more about threads and blocks - my problem here was the assumption that __syncthreads() was equivalent to a device-side cudaDeviceSynchronize(). __syncthreads() only works on threads within a block, so certain blocks got ahead of certain others by completing the atomicAdd before the zero check on another block.

I’ve also managed to speed up my code by a few orders of magnitude by using a shared-memory matrix reduction algorithm, so I’ve got that going for me :P