Simple Kernel, but having problems...ideas

Hi all,

I’m just playing around here, trying to learn some basics. I made a simple kernel that takes in a set of points and computes the euclidian distance between each point, saving that distance into an array. I keep all the x coordinates in one array, x_d. The y coordinates are kept in another array, y_d.

Here’s the Kernel, and yeah, it is very basic.

__global__ void computeDistanceArray (int *x_d, int *y_d, int *distance_d, int numPts) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.y*blockDim.y + threadIdx.y;

	int idx = i*numPts+ j;

	

	extern __shared__ int coords[];

	int *newX, *newY;

	newX = (int*)coords;

	newY = (int*)&newX[128];

	__syncthreads();

	int thread_id = blockIdx.x*blockDim.x + threadIdx.x;

	__syncthreads();

	if ( thread_id < numPts) {

		newX[thread_id] = x_d[thread_id];

		newY[thread_id] = y_d[thread_id];

	}

	__syncthreads();

	

	int xDiff = newX[i] - newX[j];

	int yDiff = newY[i] - newY[j];

	

	__syncthreads();

	

	if (i < numPts && j < numPts)

		distance_d[idx] = (int)sqrt((double)((xDiff*xDiff) + (yDiff*yDiff)));

}

This seems to work if I make the numPts just 10 or so. Meaning, my distance array, which saves all the distances, ends up being 10x10x(sizeof(int)). Once the result is computed and brought back to main, I print out the distance array and it looks good…again, when I keep numPts to just 10 or so points.

Here’s the Problem:

But then, when I increase numPts to 1000, for example, the simulation runs. However, the distance array returns ALL ZEROES! It no longer computes the distances and just returns a 1000x1000 array of ZEROES.

Is the distance array huge? Yeah. For 1000 points, the distance array is 1000x1000x(sizeof(int)), which is about 4MB. I don’t know if this has anything to do with it.

I know there is a space limitation on these GPUs, but I don’t know where to find that. I have a GTX 460. But I don’t think this is a space issue, as the code worked previously on an old Geforce 8800.

Just stumped as to what’s going on.

Thanks for any help you can give.

The kernel isn’t running at the large size because of a shared memory buffer overflow cause by this:

if ( thread_id < numPts) {

                newX[thread_id] = x_d[thread_id];

                newY[thread_id] = y_d[thread_id];

        }

The actual distance calculations are probably broken along similar lines as well…

AWESOME! Thanks for the reply.

Um, and I’m trying here, but unfortuntely I’m not getting you 100% (or even 70% for that matter).

Why is there a shared memory buffer overflow?

I did use newX and newY as shared memory. But where is the overflow and why? I thought the purpose of this if statement was to prevent precisely that. (clearly I need a LOT more studying and practice).

How do you call the kernel?

Here’s my EC and the Kernel call:

// Execution Configuration:

dim3 dimBlock(blockSize/16, blockSize/16);

dim3 dimGrid(numPts/dimBlock.x + (numPts%dimBlock.x == 0 ? 0:1), numPts/dimBlock.y + (numPts%dimBlock.y == 0 ? 0:1));

			

// Kernel Call on Device

computeDistanceArray<<<dimGrid, dimBlock, sharedMemSize>>>(x_onDevice, y_onDevice, distanceArray_onDevice, numPts);

blockSize is defined as 256. Mind you this code I’m messing with is a couple years old, and I’m not even sure what’s best regarding the dimBlock and dimGrid sizes. Should they be 1D, 2D, etc, and what is the correct block size? Is this hardware based, thereby knowing my card is a GTX 460 possibly helps? I dunno. But those questions are for after I first get the basics working anyway. I remember years ago how 256 threads per block was somehow “ideal”, but I forget the argument. Regardless, perhaps this has changed and I need to go consult the programming guide and newer, updated examples.

Thanks.

Edit

sharedMemSize is defined in main as: int sharedMemSize = blockSize*sizeof(int);

thread_id is the grid thread index in the x dimension. That will, for anything other than a trivially small grid size, be much larger than the dimensions of the shared memory allocation for the block. Overflow will result. Fermi cards can detect this and will abort the kernel. Older cards won’t.

If you don’t mind, what do I then change that line of code to?

Here’s my original, and even more simple, kernel:

__global__ void computeDistanceArray (int *x_d, int *y_d, int *distance_d, int numPts) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.y*blockDim.y + threadIdx.y;

	int idx = i*numObjects + j;

	

	int xDiff = x_d[i] - x_d[j];

	int yDiff = y_d[i] - y_d[j];

	

	__syncthreads();

	

	if (i < numPts && j < numPts)

		distance_d[idx] = (int)sqrt((double)((xDiff*xDiff) + (yDiff*yDiff)));

}

So after playing with that and reading the Dr. Dobbs tutorials, I wanted to make an attempt at using shared memory. Mind you, all of this is slow for a variety of reasons…I’m learning.

So I call the kernel with the sharedMemSize value mentioned in a previous post, and then I edited my Kernel to the original Kernel at the top of this thread. So what logical part was wrong in that editing.

Lastly, my GTX 460 is a Fermi card, and it does not return an error…just gives all zeroes as mentioned for the distance array.

Thanks.

I can’t tell you what to change it to, because I can’t fathom how you are actually envisaging using the shared memory in the “new” version of the kernel. The naïve modification would be something like this (note your dynamically declared shared memory size is also incorrect):

__global__ void computeDistanceArray (int *x_d, int *y_d, int *distance_d, int numPts) {

        int i = blockIdx.x*blockDim.x + threadIdx.x;

        int j = blockIdx.y*blockDim.y + threadIdx.y;

        int idx = i*numPts+ j;

extern __shared__ int coords[]; // this should be 2 * number of threads

        int *newX, *newY;

        newX = (int*)coords;

        newY = (int*)&newX[256]; // Number of threads here

int thread_id = blockIdx.x*blockDim.x + threadIdx.x;

	int pos = threadIdx.x + threadIdx.y * blockDim.x;

        if ( thread_id < numPts) {

                newX[pos] = x_d[thread_id];

                newY[pos] = y_d[thread_id];

        }

        __syncthreads();

if (i < numPts && j < numPts) {

		int xDiff = newX[pos] - newX[pos];

		int yDiff = newY[pos] - newY[pos];

                distance_d[idx] = (int)sqrt((double)((xDiff*xDiff) + (yDiff*yDiff)));

	}

}

but that makes little sense. It will be slower than your original version. Shared memory makes sense when there needs to be data re-use or sharing between threads within the same block, or (perhaps) when thread memory transactions can be structured across a warp in such a way that they will coalesce when they otherwise would not. The latter usage scenario was important in the compute 1.0/1.1 era when coalescing rules were rather inflexible. Today on Fermi with L1 and L2 cache, there is almost no need. So the real question is, what are you hoping to achieve through the use of shared memory?

On the subject of error messages, I am sure there are errors produced by trying to run your kernel, but you just aren’t checking for them correctly.

Thank you for the reply.

From your knowledge, other than the current programming guide, is there any resource (free OR otherwise) that not only goes over CUDA but also includes explanation and tutorials with the NEW FERMI architecture? That’s where I clearly need to spend some time.

Regarding your post and questions, I new at this and was playing around with a small kernel to calculate distances between all points. Figured that be a starter program to do. Perhaps not! I’ve gone through some of the, previously famous, Dr. Dobbs tutorials…but those are back from 2008. So based on those, my intention of shared memory was exactly as you suspected…it was based on my previous, albeit, limited knowledge of 1.0 and 1.1. So really, in summary, I need to get a good resource and go study…just don’t know what or where that resource is.

Finally, I looked at your code and don’t get a few things, but that is okay. However, one thing that jumped out at me was the following:

if (i < numPts && j < numPts) {

		int xDiff = newX[pos] - newX[pos];

		int yDiff = newY[pos] - newY[pos];

                distance_d[idx] = (int)sqrt((double)((xDiff*xDiff) + (yDiff*yDiff)));

	}

Wouldn’t xDiff and yDiff ALWAYS end up being 0 since you are performing the following subtraction: newX[pos] - newX[pos] ?

Thanks.

Let me add that I just purchased the Cuda by Example book.

But is that book even valid with little talk of Fermi architecture? I understand that so much has changed with the advent of Fermi. Even in my 3.2 programming guide, it talks about the need for shared memory. But you suggested that on 2.0 or better GPUs, shared memory isn’t nearly as necessary as it used to be.

-Jonathan

Yes and that was deliberate. It isn’t my job to correct your algorithm for you - as I said in the previous post, the shared memory usage in that kernel is completely nonsensical as it was written, so it is a bit hard to know how to fix it without a clear indication of what the real intention of the shared memory use was supposed to be in the first place. I can see ways to do what you want using a shared memory buffer (and perhaps more efficiently that what you started with), but everything from the way the kernel is launched downwards would have to be considerably restructured to do so.

My intention: well, with my, clearly, limited knowledge and with the humble readings I’ve done, the intention was to make things “faster”. Whenever possible, I try to work on my own instead of running to forums and asking questions without first making an attempt on my own. As such, I tried to follow along with the now outdated Dr. Dobbs example of where he did a “basic” kernel call and then did a “shared memory” version of the same kernel:

From what you are saying, I CLEARLY don’t get it, thus the reason for starting this thread.

No problem, and I WANT TO LEARN! My challenge is finding a lower-level resource to assist in this, as the programming guide isn’t always the easiest to follow along with. As mentioned, I just recently purchased CUDA by Example, but even that book (if I’m correct) doesn’t talk about Fermi at all, since it was published (and therefore written well before) mid 2010.

Mind you, forum members, if able and willing, help as you deem appropriate. I’m merely here looking for guidance, even if that guidance is a great resource that “teaches me how to fish”. I’m not looking for free fish; rather, I’m looking to understand this stuff. Regardless, this is just for fun, as I’m trying to learn the basics of CUDA.

Back to my Kernel, I’m just trying to get a simple, albeit, fast way of calculating all the euclidean distances. From the Dr. Dobbs example, I understood that using shared memory helps in this. But I didn’t grasp how much shared memory should be used. As such, I just set it to 256*sizeof(int), based on other examples I reviewed. Again, I’m posting here for guidance. If someone is willing to spend time to help clarify and explain things to the small people like me, GREAT. If someone has a great (and easy to follow) resource, EVEN BETTER!!

Thanks.

The your original kernel computes Euclidean distances between each element of a pair of vectors. That sort of algorithm is very much like a vector outer product - 2N inputs produce N^2 outputs. The original version launches N^2 threads, which do 2(N^2) global memory reads. That is a long way from the lower bound of 2N global memory reads, and where performance improvements might be made in CUDA, given the expense of global memory transactions. So the question you should be probably asking yourself is how can you change the algorithm to reduce the number of global memory transactions required to perform the computation, rather than what sort of magic fairy dust can be sprinkled over the code you have in order to miraculously make it go faster.

avidday,

THANK YOU! Seriously, you called me out. I claimed to be doing this for fun, which I am, but as you could see from my code, there clearly was little thought involved. To be honest, your wording of “rather than what sort of magic fairy dust can be sprinkled over the code you have in order to miraculously make it go faster” WAS SPOT ON!

So, on a serious note, I appreciate the wake up call.

I AM trying to take this seriously. I purchased “CUDA by Example” and am currently up to page 69. I have read through the examples, IN DETAIL, trying to make sure I get what’s going on. Looking back, I had little knowledge of blockDim, gridDim, and other such mechanisms, and I had no business trying to throw together an algorithm without proper time investment. In addition to the book, I’ve read up on several examples shown here, with the following post being AWESOME:

I am NOT up to the chapter 5.3 yet, which goes over shared memory; so I will not try to use it in my algorithm just yet. For now, I’m trying to take what knowledge I’ve gained and see if I can come up with a better algorithm, albeit, perhaps stil naive.

Looking at the algorithm presented, it was downright stupid. Given an N points (via an x array of N values and a y array of N values), I wanted to fill out a 2D array of size NxN with the Euclidean distances between these points. Firstly, it was hodge-podged together by looking at various examples, trying to glue things together, and AS YOU SAID IT, hoping to sprinkle fairy dust on the code and have it work! (idiotic is an understatement). Additionally, until now, I was launching N^2 threads, where each thread solved one cell of this matrix. Unless you correct me, that seems quite assinine! If N was size 10,000, I was launching 100 MILLION threads. From what I read, GPUs reach peak efficiency around 20,000 or so threads.

On a previous post, someone said that I need to launch N threads and then have each thread solve one ROW of this 2D distance matrix.

So, if you, or anyone else, can humor me, I have tried to actually use my brain here. I wrote a new, still simple kernel. However, I did NOT reference ANY examples for this. I simply asked myself, “how many threads do I need?” Well, I want N threads, where each calculates one row of the distance matrix. So instead of trying to be fancy-shmancy, using 2D blocks and a 2D grid (for which I don’t yet see the advantage of anyway), I just launch enough 1D blocks, of 128 threads each (also in 1D), to make sure we have N threads executing.

Here’s my Execution Configuration:

// blockSize is defined as 128

dim3 dimBlock(blockSize);

dim3 dimGrid(numPts/dimBlock.x + (numPts%dimBlock.x == 0 ? 0:1));

computeDistanceArray<<<dimGrid, dimBlock>>>(x_onDevice, y_onDevice, distanceArray_onDevice);

And here’s my Kernel:

__global__ void computeDistanceArray (int *x_d, int *y_d, int *distance_d) {

	int Tidx = blockIdx.x*blockDim.x + threadIdx.x;

	int i;

	int xDiff;

	int yDiff;

	while (i < numPts) {

		xDiff = x_d[Tidx] - x_d[i];

		yDiff = y_d[Tidx] - y_d[i];

		if (Tidx < numPts)

			distance_d[Tidx*numPts + i] = (int)sqrt((double)((xDiff*xDiff) + (yDiff*yDiff)));

		i++;

	}

}

Again, I am not referencing any example and am actually trying to think about this. So I launch N threads, and EACH thread will run this kernel code. I know my calculation of Tidx is correct; so I feel good about that. Yeah yeah, maybe cheesy. But nice to know I’m slowly gaining a small understanding.

So my logic was, "okay, I need Thread # 7, for example, to evaluate the 8th row of the distance array. So I use a while loop to iterate over the width of that array (which is numPts long, or N long). I like how the “CUDA by Example” book breaks down the idea that each thread executes the Kernel. So with that said, here’s how I see thread # 7, for example, executing this Kernel:

while (i < numPts) {

		xDiff = x_d[7] - x_d[i];

		yDiff = y_d[7] - y_d[i];

		if (7 < numPts)  // doesn't perform calculations for the extra threads, when launched threads are not a multiple of blockSize

			distance_d[7*numPts + i] = (int)sqrt((double)((xDiff*xDiff) + (yDiff*yDiff)));

		i++;

	}

So in my humble mind, this should be okay.

And guess what, it’s finally working! NO FAIRY DUST NEEDED!

Unfortunately, it is SLOWER than the CPU, which brings me to why I’m posting.

Is there a logical reason why this is slower. If the “answer” lies in the next chapter or so of the “CUDA by Example” book, then just reply telling me to keep on reading, and I will (and still am regardless).

Just wanted to post an update and see if one of you experts can tell me if I’m on the right track (if only slightly so!).

Thanks.

Consider the following three kernels:

__device__ __host__ int distcalc(const int x0, const int y0, const int x1, const int y1)

{

    int xDiff = x0 - x1;

    int yDiff = y0 - y1;

return (int)sqrtf((float)((xDiff*xDiff) + (yDiff*yDiff)));

}

__global__ void computeDistanceArray0(int *x_d, int *y_d, int *distance_d, int numPts, size_t lda)

{

    int i = blockIdx.x*blockDim.x + threadIdx.x;

    int j = blockIdx.y*blockDim.y + threadIdx.y;

    int idx = i*lda + j;

if ( (i < numPts ) && (j < numPts) )

        distance_d[idx] = distcalc(x_d[i],y_d[i],x_d[j],y_d[j]);

}

__global__ void computeDistanceArray1(int *x_d, int *y_d, int *distance_d, int numPts, size_t lda)

{

    int Tidx = blockIdx.x*blockDim.x + threadIdx.x;

    int i = 0;

while (i < numPts) {

        if (Tidx < numPts)

            distance_d[Tidx*lda + i] = distcalc(x_d[Tidx],y_d[Tidx],x_d[i],y_d[i]);

        i++;

    }

}

template<int bsize>

__global__ void computeDistanceArray2(int *x_d, int *y_d, int *distance_d, const int n1, const int n2, const size_t lda)

{

    __shared__ volatile int xl_d[bsize], yl_d[bsize];

    int Tidx = blockIdx.x*blockDim.x + threadIdx.x;

int x_d0, y_d0;

    if (Tidx < n2) {

        x_d0 = x_d[Tidx];

        y_d0 = y_d[Tidx];

    }

int * opos = distance_d + Tidx;

    for(int i=0; i<n1; i+=bsize) {

xl_d[threadIdx.x] = x_d[Tidx+i];

        yl_d[threadIdx.x] = y_d[Tidx+i];

        __syncthreads();

#pragma unroll

        for(int j=0; (j<bsize) && (Tidx<n2); j++, opos+=lda) {

            * opos = distcalc(x_d0, y_d0, xl_d[j], yl_d[j]);

        }

    }

for(int i=n1; (i<n2) && (Tidx<n2); i++, opos+=lda) {

        * opos = distcalc(x_d0, y_d0, x_d[i], y_d[i]);

    }

}

The first is what you started with, although with the distance calculations done in single precision. The second is a sane version of what you are trying to do in the last post (there are a number of problems in that code as you posted it). The third is a version which uses shared memory. When I compile and run them on a GTX470 with CUDA 3.2, I get this:

~$ nvcc -arch=sm_20 -Xcompiler="-Wall -O2" -Xptxas="-v" -use_fast_math perkele.cu -o perkele

ptxas info    : Compiling entry function '_Z21computeDistanceArray2ILi128EEvPiS0_S0_iim' for 'sm_20'

ptxas info    : Used 22 registers, 1024+0 bytes smem, 72 bytes cmem[0], 4 bytes cmem[16]

ptxas info    : Compiling entry function '_Z21computeDistanceArray1PiS_S_im' for 'sm_20'

ptxas info    : Used 13 registers, 72 bytes cmem[0]

ptxas info    : Compiling entry function '_Z21computeDistanceArray0PiS_S_im' for 'sm_20'

ptxas info    : Used 12 registers, 72 bytes cmem[0]

david@cuda:~$ ./perkele 

serial cpu time = 0.803936

kernel 0 time = 0.016454

kernel 1 time = 0.082026

kernel 2 time = 0.004173

and on a GTX 275 this:

~$ nvcc -arch=sm_13 -Xcompiler="-Wall -O2" -Xptxas="-v" perkele.cu -o perkele

ptxas info    : Compiling entry function '_Z21computeDistanceArray2ILi128EEvPiS0_S0_iim' for 'sm_13'

ptxas info    : Used 12 registers, 1064+16 bytes smem, 4 bytes cmem[1]

ptxas info    : Compiling entry function '_Z21computeDistanceArray1PiS_S_im' for 'sm_13'

ptxas info    : Used 8 registers, 40+16 bytes smem

ptxas info    : Compiling entry function '_Z21computeDistanceArray0PiS_S_im' for 'sm_13'

ptxas info    : Used 8 registers, 40+16 bytes smem, 4 bytes cmem[1]

david@cuda:~$ ./perkele 

serial cpu time = 0.809125

kernel 0 time = 0.095100

kernel 1 time = 0.136079

kernel 2 time = 0.006903

Each run is using a 10000 long random vector containing values between 0 and 32679.

The conclusions you should arrive at are: the shared memory version is faster, but the improvement is a lot more noticeable on the GT200 than Fermi, because the L1 and L2 cache make you original version perform much better than it would otherwise. I will leave you to analyse what the codes actually do.

Detailed, comprehensive reply. Thank you!

From my original post, the fact that I’m a beginner is clear. As such, before I can begin to look at the third kernel you presented, I wanted to understand the changes made to the first two.

Here’s the first Kernel you presented:

I immediately noticed the argument “size_t lda”. Of course, as a beginner, this was meaningless to me. However, as mentioned, I don’t like to waste other’s time asking questions if I can search and find answers. This was not an easy one, but I gathered the following:

  • lda is referring to “pitch”, which until now, I was completely unaware of.

  • Therefore, with pitch being an argument of this kernel, the assumption is that cudaMallocPitch would be used to allocate space instead of cudaMalloc

Is that correct?

As mentioned, I’ve never dealt with array pitch previously, and spent about an hour reading. Along with many other sites and explanations, I found the following comment on another thread here:

Please confirm if I understand this correctly, as I’ve never used cudaMallocPitch and have always used simply cudaMalloc. When using cudaMalloc, I can specify an exact row size, and that EXACT row size will indeed be malloced. The problem being, of course, that this row size may not provide the best memory alignment.?. So the solution is to use cudaMallocPitch? This allows CUDA to over-allocate, effectively padding the rows as needed, with the intention of making the code run faster.?. Just want to confirm the use of cudaMallocPitch.

Assuming this was your intention of changing the parameter list and including “size_t lda”, and assuming that I understand the purpose of lda (pitch) and cudaMallocPitch, then, absolutely, from what I read, this makes sense. Additionally, the Programming Guide describes cudaMallocPitch as being used with 2D and 3D allocations. So I’m also assuming the intention was to only use cudaMallocPitch for the 2D distance array and not for the 1D x_array and y_array.

Finally, once I FULLY get the first two kernels (mainly your confirmation that lda was the pitch and then my understanding of pitch and cudaMallocPitch), THEN I will try to make sense of the 3rd kernel you wrote.

As a head start, can anyone provide insight to the following line:

template<int bsize>

I’ve googled that to no avail.

Thanks.

Yes, lda is a pitch, but it can easily be set to the number of points if you want and the whole question of padding ignored. Having an appropriate pitch just ensures that memory operations fall cleanly on 32 word size boundaries, which helps memory peformance, although to be honest the fact I always include one is more of a hangover from working with Fortran multidimensional arrays than anything else. You don’t need to use cudaMallocPitch, I just used cudaMalloc:

const int numPts = 10000;

    const size_t insize = size_t(numPts) * sizeof(int);

    const size_t psize = 32;

    const size_t lda = psize * ((numPts/psize) + (((numPts%psize) > 0) ? 1 : 0));

    const size_t outsize = size_t(lda) * size_t(numPts) * sizeof(int);

// Device memory setup

    int * x_d, * y_d, * d_d;

    errchk( cudaMalloc((void **)&x_d, insize) ); 

    errchk( cudaMalloc((void **)&y_d, insize) ); 

    errchk( cudaMalloc((void **)&d_d, outsize) );

for your code.

The template specifier is a standard C++ language feature - templating allows types and constants to be specified to the compiler during compilation. The calling code of the templated kernel I used for the benchmark looks like this:

int cDA2(int * x_d, int *y_d, int * distance_d, int numPts, size_t lda)

{

    const int bsize = 128;

    int bcount = (numPts / bsize);

    int n1 = bcount * bsize;

    int n2 = numPts;

    bcount += (((numPts % bsize) > 0) ? 1 : 0);

dim3 griddim(bcount,1);

    dim3 blockdim(bsize,1,1);

computeDistanceArray2<bsize><<<griddim,blockdim>>>(x_d, y_d, distance_d, n1, n2, lda);

return cudaerrchk(cudaPeekAtLastError(), __FILE__, __LINE__);

}

In this case the template parameter specifies the value of the variable blocksize to the compiler. A preprocessor symbol could have been used instead, however templating allows multiple versions of the same code to be compiled with different types or constants and selected at run time as required. It is also something you can safely ignore if you want.