modifying simulator to work with CUDA.?.

You are correct. I intentionally did that, although I obviously wasn’t aware of how it would work. On a previous post in this thread, I mentioned my logic:

So at least you see what the logic was. I was assuming, that although variables i and j are created from the same line of code, I was assuming that the various i’s and j’s would not be the same on each line, but that the threads would compute all possibilities of those i’s and j’s.?. Guess that was wrong. But at least you see the main point that the X’s and Y’s need to be paired up.

But what is the solution? I need to make an i variable as well as a j variable inside the Kernel. Is the only way to do that by i reference one dimention and j reference another dimention, such as the following:

int i = blockIdx.x*blockDim.x + threadIdx.x;

int j = blockIdx.y*blockDim.y + threadIdx.y;

But this would mean I’d have to make my values in the Execution Configuration different…I’d have to make blocksize an gridSize setting.

Would this work?

//*** Execution Configuration ***//

dim3 dimBlock(blockSize/16, blockSize/16);  // thus, still having 256 threads per block

dim3 dimGrid(numObjects/dimBlock.x + (numObjects%dimBlock.x == 0 ? 0:1), numObjects/dimBlock.y + (numObjects%dimBlock.y == 0 ? 0:1));

computeDistanceArray<<<dimGrid, dimBlock>>>(x_onDevice, y_onDevice, distanceArray_onDevice, numObjects);

// ******* KERNEL *******//

__global__ computeDistanceArray (int *x, int *y, int *d, int numObjects) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.y*blockDim.y + threadIdx.y;

	int idx = i + j*numObjects;

	if (i < numObjects && j < numObjects)

		d[idx] = (int) sqrt(((x[i] - x[j])^2) + ((y[i] - y[j])^2));

}

Also, why are your x, y, and d arrays of type int? I don’t know what your inputs look like, but they may be getting rounded down to zero during the int conversion (or your output ‘d’ is zero for a similar reason). Why not try float?

Well, that didn’t work either. Still all zeroes. So my little head is stuck here. I’m using different variables. These different variables (i and j) are set differently (i with the 1st dimension and j with the 2nd). But it is still apparently looking at i and j as having the same values.

Ideas?

Main reason was int was all that was needed. The coordinates are integer values…and many of them are large values (ex. xArray[462] could be 9751). These values for x and y arrays are not zero…i’m test printing them before sending them over.

The d array is not zero for that reason either.

I’m really thinking the problem is what the other poster mentioned. He said the problem was that, in the kernel, i was setting i and j up with the same line of code:

int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.x*blockDim.x + threadIdx.x;

He said this was basically calculating, distance = (int)sqrt((a-a)^2 + (b-B)^2)

So I changed that to this:

int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.y*blockDim.y + threadIdx.y;

It is still all zeroes…i’m not really seeing the solution.

:huh:

Good morning,

the new block and grid config looks fine to me.

But another thing that just came to my mind is the usage of ^2. When talking were using this as a synonym for square, but CUDA (and C) does an bitwise XOR if im not mistaken.

So my guess would be to build your kernel somewhat like:

// ******* KERNEL *******//

__global__ computeDistanceArray (int *x, int *y, int *d, int numObjects) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.y*blockDim.y + threadIdx.y;

	int idx = i + j*numObjects;

	if (i < numObjects && j < numObjects)

	{

		int xDiff = x[i] - x[j];

		int yDiff = y[i] - y[j];

		d[idx] = (int) sqrt((xDiff * xDiff) + (yDiff * yDiff));

	}

}

Maybe the sqrt of possibly negative values caused some problems.

Otherwise i have no idea, what could lead to the zero differences.

WOW, that looks like a biggee! I changed it and did one small test, and the things are working significantly different. I can’t believe I was actually using that in my code!

Does it work properly? I don’t want to jump the gun and say yes. But this made a HUGE move in the right direction. I have to rn out for now and will do more thorough testing upon returning later.

Maybe in the meantime, someone can confirm this. Also, I wonder if the i’s and j’s really should be different lines of code…

Hi MisterAnderson42,

I am also waiting for your answer.

Thanks.

An answer to what question? The text you quoted mentions resources I linked to previously in this thread. They have all the answers.

Hi ,

First of all thanks for quick replay.

I correct myself.I want to know :

  • How to handle two for loops in CUDA efficiently for large arrays?

    I have read examples those are in books i.e.

[b]global add_matrix( float* a, float* b, float* c, int N ) {

                  int i = blockIdx.x * blockDim.x + threadIdx.x;

                  int j = blockIdx.y * blockDim.y + threadIdx.y;

                  int index = i + j*N;

                      if ( i < N && j < N )

                             c[index] = a[index] + b[index];

                             }[/b]

but i cannot write this for some complex type of problems(various If else conditions with in two nested for loops or calling device functions within

 nested loops) ?
  • My second question is how to write execution configuration for kernel launch?

Thanks .

Your question is off-topic a little bit and hijacking this thread. You should create a new one if you have a specific question.

In the context of this thread, the fastest O(N^2) way is to run one thread per column and have a loop in the kernel going down the “rows” (cached in shared memory) to identify neighbors. There is of course an O(N) algorithm, but that is way to complicated to discuss in the forums. Read my paper on the topic.

In the more general case, the solution entirely depends on what you your algorithm is what you calculate. I.e. break up a large array into a 2D block because 1D isn’t big enough, run one block per part and do parallel reductions/scans to collect data, etc…, etc… there are dozens of possibilities.

Umm, you choose a grid and thread block configuration that are appropriate for the assignment of threads and blocks in your parallel algorithm and the number of elements in the data. There is no “one right answer”.

First of all, I want to thank you MisterAnderson42 for helping me on this problem. I’m new to CUDA and these forums have been very helpful.

I’m getting back to this problem after a break for a couple of weeks. My GPU extended simulator initially seemed to have a 3x - 5x speedup over the CPU simulation. And from the GPU perspective, this really isn’t something to get excited about. However, I figured it could be sped up more with a more efficient algorithm on the Kernel, learning how to use the shared memory, etc.

But, now that I come back to this two weeks later and look at my code, I am shocked that I made the same silly mistake, on the CPU version of the simulator, of trying to “square” values in my distance formula by typing ^2 in my code. HAH! That’s what happens when you first type psuedocode and then quickly throw it all together.

So anyway, upon correcting that and actually performing proper “squares”, I come to find that my GPU simulator actually runs SLOWER that its CPU counterpart…SLOWER. Not a ton slower…but nevertheless, slower.

Could you please take a look at my Kernel and tell me if you see a reason why this multithreaded GPU version would be slower than the CPU version:

//Here's the Execution Configuration:

dim3 dimBlock(blockSize/16, blockSize/16);

dim3 dimGrid(numObjects/dimBlock.x + (numObjects%dimBlock.x == 0 ? 0:1), numObjects/dimBlock.y + (numObjects%dimBlock.y == 0 ? 0:1));

computeDistanceArray<<<dimGrid, dimBlock>>>(x_onDevice, y_onDevice, distanceArray_onDevice, numObjects);

//And here's the Kernel:

__global__ void computeDistanceArray (int *x_d, int *y_d, int *distance_d, int numObjects) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.y*blockDim.y + threadIdx.y;

	int idx = i*numObjects + j;

	int xDiff = x_d[i] - x_d[j];

	int yDiff = y_d[i] - y_d[j];

	if (i < numObjects && j < numObjects)

		distance_d[idx] = (int)sqrt((double)(xDiff*xDiff) + (yDiff*yDiff));

}

I may have just answered my own question…to some extent. On the CPU simulator, I have a function that queries all objects. Basically it’s an O(N^2) funciton that has two for loops that go through all the objects and calculates the distance between each and every object. IMMEDIATELY upon calculating the distance, if the distance is less than the query range, object B is added to the neighbor list of object A.

Now, for the GPU based simulator, I calculate the distances on the GPU…so we avoid here the two for loops and instead, ideally, are using a bit of the parallel processing capability of the GPU. But then, AFTER all the distances are calculated, I return back to the CPU based program and have two for loops that run through this newly created distance array. Just like on the CPU simulator, if the distance is less than the query range, object B is added to the neighbor list of object A.

So, in answering my question, perhaps the GPU simulator is not faster because EITHER way i’m running a O(N^2) for loops. On the CPU Simulator, this for loop (1) calculates the distance, (2) compares it to the query radius, and (3) then adds a neighbor if necessary. On the GPU simulator, the for loops do the second two steps, since the Kernel does the distance calculation.

So if that makes sense as to why it is slower, then I’m really hurting myself here. Basically, why use the GPU if I’m going to come back and run an O(N^2) for loops to test something anyway? Is that right?

So, what I then did, just for testing purposes, was to just IGNORE the for loops on the GPU simulator. I just commented them out. Upon getting the distance array back from the GPU, I do nothing with it at all. So of course, the simulator is useless…but the point is to see if it is much faster.

Without having to run through those for loops, the GPU was faster…but not by much…maybe a whopping 25% or so.

So my first question is still valid. And that is, is there something I’m doing wrong with my Kernel that is not taking proper advantage of inherent parallelism of the GPU?

If you have the time, I would really appreciate if you could help.

Thanks again.

EDITing the post now. It’s about 20 minutes later, and perhaps I’m still answering my own question.

Right now, on my little example, the CPU simulator runs at lets say around 40 seconds. And the GPU version maybe does 30 seconds. However, the number of objects is only 10,000. As a result, the CPU is actually able to run through this pretty quickly (it takes about 40 seconds for 20 iterations of moving the objects and then calculating the distances, etc.).

Since the time here (30-40 seconds) is so little, could the reason I don’t see speedup be because I’m not having the chance to really see the GPU in action. That’s not realy clear. What I’m saying is that there is COST associated with using the GPU, and that cost is the tranferring of this distance array back and forth between the host and the device on each iteration. IDEALLY, this cost is offset by a MASSIVE speedup of computation on the GPU. BUT in my case, since the number of mobile objects is small (only 10,000), the GPU really isn’t having to work that hard, and as a result, the COST of communication isn’t really getting ofset. I think that’s a little bit more clear now.

So if this makes sense, I would LOVE to simply increase the mobile objects to 100,000, which takes the CPU WAAAAAAAAAY too long for even one iteration…I won’t wait it takes so long. That would be a PERFECT number of objects to test on the GPU. However, I can’t create a distance array that large on the device. The device has a space limitation:

int *distanceArray;

int size_d = numObjects*numObjects*sizeof(int);

// This allocation is possible on host even with numObjects = 100,000

distanceArray = (int*)malloc(size_d);

int *distanceArray_onDevice;

// But this allocation fails on the device

cudaMalloc((void**)&distanceArray_onDevice, size_d);

So if you agree I could indeed be on to something here, meaning with regards to that my GPU code may not be that slow after all…I’m just not giving it a chance to be tested properly, then could you please advise how one goes about having that many nodes and getting their distances on the GPU.

Thanks.

I was looking back at this response you gave a few weeks ago.

First of all, the followig makes the assumption that you agree that the GPU only seems slower probably because I’m not performing enough of a computation on it to see the benefit.

So if that is the case and if i just need to up my # of objects to 100,000 or even more, then the ? is how do I do that.

Previously (quoted above) you stressed not storing the distance values. So if I have 100,000 objects, for example, you say I should launch 100,000 threads, and then in the kernel, I guess there would be a for loop. This for loop would go through all the objects, allowing each thread to calculate the distance between a given object and all other objects. And then, INSIDE the for loop, if the newly calculated distance is less than some value, I would add it to the neighbor list. So this means I need to generate space on the GPU for a neighbor list?

Is that correct?

Could anyone help me out with the Kernel on this one. I’m a bit confused on how to go about it.

Thanks.

100,000 objects shouldn’t be needed. In fact, I don’t know how you would begin to run that many objects, since storing 100,000^2 distances would require 37 GiB of memory… Anyways, GPUs start to hit peak efficiency with only 20,000 running threads.

Here is the kernel I mentioned you look at from HOOMD for the O(N^2) neighbor list. (I’ve stripped it down some to make it easier to read)

/*

	each thread is to compute the neighborlist for a single particle i

	each block will load a bunch of particles into shared mem and then each thread will compare it's particle

	to each particle in shmem to see if they are a neighbor. Since all threads in the block access the same 

	shmem element at the same time, the value is broadcast and there are no bank conflicts

*/

extern "C" __global__ void gpu_compute_nlist_nsq_kernel(gpu_nlist_array nlist, gpu_pdata_arrays pdata, gpu_boxsize box, float r_maxsq)

	{

	// shared data to store all of the particles we compare against

	__shared__ float sdata[NLIST_BLOCK_SIZE*4];

	// load in the particle

	int pidx = blockIdx.x * NLIST_BLOCK_SIZE + threadIdx.x;

	float4 pos = make_float4(0.0f, 0.0f, 0.0f, 0.0f);

	if (pidx < pdata.N)

		pos = pdata.pos[pidx];

	float px = pos.x;	float py = pos.y;

	float pz = pos.z;

	// track the number of neighbors added so far

	int n_neigh = 0;

	// each block is going to loop over all N particles (this assumes memory is padded to a multiple of blockDim.x)

	// in blocks of blockDim.x

	for (int start = 0; start < pdata.N; start += NLIST_BLOCK_SIZE)

		{

		// load data

		float4 neigh_pos = make_float4(0.0f, 0.0f, 0.0f, 0.0f);

		if (start + threadIdx.x < pdata.N)

			neigh_pos = pdata.pos[start + threadIdx.x];

		// make sure everybody is caught up before we stomp on the memory

		__syncthreads();

		sdata[threadIdx.x] = neigh_pos.x;

		sdata[threadIdx.x + NLIST_BLOCK_SIZE] = neigh_pos.y;

		sdata[threadIdx.x + 2*NLIST_BLOCK_SIZE] = neigh_pos.z;

		sdata[threadIdx.x + 3*NLIST_BLOCK_SIZE] = neigh_pos.w; //< unused, but try to get compiler to fully coalesce reads

	   // ensure all data is loaded

		__syncthreads();

		// now each thread loops over every particle in shmem, but doesn't loop past the end of the particle list (since

		// the block might extend that far)

		int end_offset= NLIST_BLOCK_SIZE;

		end_offset = min(end_offset, pdata.N - start);

		if (pidx < pdata.N)

			{

			for (int cur_offset = 0; cur_offset < end_offset; cur_offset++)

				{

				// calculate dr (applying periodic boundary conditions)

				float dx = px - sdata[cur_offset];

				dx = dx - box.Lx * rintf(dx * box.Lxinv);

				float dy = py - sdata[cur_offset + NLIST_BLOCK_SIZE];

				dy = dy - box.Ly * rintf(dy * box.Lyinv);

				float dz = pz - sdata[cur_offset + 2*NLIST_BLOCK_SIZE];

				dz = dz - box.Lz * rintf(dz * box.Lzinv);

				float drsq = dx*dx + dy*dy + dz*dz;

			   // we don't add if we are comparing to ourselves, and we don't add if we are above the cut

			   if ((drsq < r_maxsq) && ((start + cur_offset) != pidx))

				   {

				   if (n_neigh < nlist.height)

					   {

					   nlist.list[pidx + n_neigh*nlist.pitch] = start+cur_offset;

					   n_neigh++;

					   }

				   else

						*nlist.overflow = 1;

				   }

				}

			}

		}

	// now that we are done: update the first row that lists the number of neighbors

	if (pidx < pdata.N)

		{

		nlist.n_neigh[pidx] = n_neigh;

		nlist.last_updated_pos[pidx] = pdata.pos[pidx];

		}

	}

Even with only 500 particles, a GTX 280 is ~5x faster than a Intel Q9300

With 20,000 particles, the speedup is 83.

With 50,000 particles, the speedup is 102.

I’m not patient enough to run anything bigger. O(N^2) calculations are so slow :P

Okay, well two quick questions for you.

Even though my algorithm is NOT optimized, if you agree it should speed up the GPU-based simulator, do you further agree the reason it current is not speeding it up is because the CPU-based simulator is working at just a second or so per iteration? As such, it’s perhaps difficult to notice the GPU benefit on such a small computation.

Secondly, my initial goal is just to increase the amount of computation, per iteration, on BOTH simulators. And with my current setup of sending the whole distance array, this really isn’t possible to add more objects.

As such, and again since my goal is merely to significantly add computation, on a per iteration basis, I’d like to just add a loop to the kernel…basically have it do the SAME computation 1000 times for example. Is that possible? On the similar part of the CPU simulator, where i calculate the distances and neighbor lists, I added a loop having it also to the same repeated computation 1000 times.

So basically, I’d like to do something like this (but i’m unsure about how to execute this for loop on the GPU):

__global__ void computeDistanceArray (int *x_d, int *y_d, int *distance_d, int numObjects) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.y*blockDim.y + threadIdx.y;

	int idx = i*numObjects + j;

	for (int k = 0; k < 1000; k++) {

		int xDiff = x_d[i] - x_d[j];

		int yDiff = y_d[i] - y_d[j];

		if (i < numObjects && j < numObjects)

			distance_d[idx] = (int)sqrt((double)(xDiff*xDiff) + (yDiff*yDiff));

	}

}

When I ran that program, it seems like the GPU had a stroke. My monitor turned off, claiming it was receiving no signal. I waited and waited. Then i just had to manually power off the CPU. So my guess is this wasn’t the correct way of executing the for loop.

I know you probably think this is crazy. But at this point, I just want to see if there is a speedup on the GPU. Ideally, I could just increase the # of objects, but this isn’t possible. That’s why im doing it this backwards way…just for the purpose of testing for speed increases.

Thanks.

Umm, I don’t really agree. A “too-small” computation for the GPU is one where you are running less than a thousand of threads (as I said above) or the computation only takes less than 20 microseconds. A 1 second computation time is extremely long!

Sure, it is possible. But I don’t really see a point in it. You are no longer benchmarking anything practical. And you are also at the whim of the compiler. Smart compilers can see that you are doing the same numerical calculations in each iteration of the loop, factor them out, and then notice that the loop does nothing and remove it. In other words, unless you read the assembly code, you don’t know if you are actually performing the calculations in a loop or not… Of course, you can always introduce a dependancy between loop iterations to prevent the compiler from doing that.

If you calculation is so short (microseconds) that you need to average over many loops to get an estimation of the run time, just call the kernel many times.

When I ran that program, it seems like the GPU had a stroke.  My monitor turned off, claiming it was receiving no signal.  I waited and waited.  Then i just had to manually power off the CPU.  So my guess is this wasn't the correct way of executing the for loop.

Well, if you were already at a computation time of 1 second, increasing it would put you over the 5s watchdog timer limit. Your app should have timed out and been killed after 5 seconds of runtime. If you are on windows, this is sometimes known to take down the entire system instead of failing gracefully.

Ok, that’s fine. There is nothing wrong with dong crazy things. But I can tell you right now that you won’t see much of a speedup with this code:

__global__ void computeDistanceArray (int *x_d, int *y_d, int *distance_d, int numObjects) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.y*blockDim.y + threadIdx.y;

	int idx = i*numObjects + j;

	for (int k = 0; k < 1000; k++) {

		int xDiff = x_d[i] - x_d[j];

		int yDiff = y_d[i] - y_d[j];

		if (i < numObjects && j < numObjects)

			distance_d[idx] = (int)sqrt((double)(xDiff*xDiff) + (yDiff*yDiff));

	}

}
  1. You’ve got 4 memory reads, 1 memory write and 6 FLOPS per iteration per thread. Thus your performance is definitely bound by memory bandwidth.

  2. 2 of the 4 memory reads are uncolaesced (x_d[j] and y_d[j]) which is a factor of ~20 performance decrease in memory bandwidth.

Maybe you don’t realize how bad uncoalesced reads are:

On a GTX 285 a coalesced read gives 130 GiB/s of bandwidth while an uncoalesced one is only around 6 GiB/s. Just this one thing puts you back into a realm where the CPU is capable of delivering the same performance. So memory bandwidth bound kernels with a high ratio of uncoalesced accesses are rarely much faster than a corresponding CPU implementation.

If you want to run that same kernel with minimal modifications, then bind x_d and y_d to textures with cudaBindTexture and read them with tex1Dfetch. It is not the ideal thing to do performance-wise, but it is the simplest modification to your existing code to get something approaching full performance from the memory reads.

You know, you’ve definitely proven to me one thing: I can’t just try to understand this part of CUDA or that part. Rather, I need to just sit down and spend the time to go through the programming guide as well as perhaps even follow along with one of the classes given on the Education links. Apparently, they even have video files to watch the presentation while listening to the lecture. I’ve basically made up my mind to do this over the next couple of months.

Honestly, I don’t really get your Kernel. It’s a bit to technical at this point. I need to learn more about share memory and the memory access patterns of the GPU.

Here’s a paper I found that was simply calculating the Euclidean distances between points:
[url=“http://bioinformatics.louisville.edu/ouyang/Chang_etal_CBB2008_634-017.pdf”]http://bioinformatics.louisville.edu/ouyan...008_634-017.pdf[/url]

Would you say their implementation is simply and something i should perhaps look at?

Thanks.

Okay, quick question. Currently, the speedup on this very, VERY humble kernel is a whopping 3x. I’m not familiar enough with memory on the GPU to understand coalescing, aligning the data structures to the boundries, etc. But i will learn soon!

In the meantime, i saw someone mention you can get at least a bit more speedup by simply first moving your data to the shared memory on the device. Simple as that. First move to shared memory and then execute the rest of the kernel.

I did that (code below), and this resulted in a 10x speedup overall. So from 3x to 10x. Does that make sense? What you expected a 300% speedup over the previous version just by adding the arrays to shared memory first?

Here’s the old kernel:

__global__ void computeDistanceArray (int *x_d, int *y_d, int *distance_d, int numObjects) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.y*blockDim.y + threadIdx.y;

	int idx = i*numObjects + j;

	

	int xDiff = x_d[i] - x_d[j];

	int yDiff = y_d[i] - y_d[j];

	__syncthreads();

	if (i < numObjects && j < numObjects)

		distance_d[idx] = (int)sqrt((double)(xDiff*xDiff) + (yDiff*yDiff));

}

And here is the new kernel:

__global__ void computeDistanceArray (int *x_d, int *y_d, int *distance_d, int numObjects) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.y*blockDim.y + threadIdx.y;

	int idx = i*numObjects + j;

	__shared__ float newX[numMobileObjects];

	__shared__ float newY[numMobileObjects];

	int thread_id = blockIdx.x*blockDim.x + threadIdx.x;

	if ( thread_id < numObjects) {

		newX[thread_id] = x_d[thread_id];

		newY[thread_id] = y_d[thread_id];

	}

	__syncthreads();

	int xDiff = newX[i] - newX[j];

	int yDiff = newY[i] - newY[j];

	__syncthreads();

	if (i < numObjects && j < numObjects)

		distance_d[idx] = (int)sqrt((double)(xDiff*xDiff) + (yDiff*yDiff));

}

So does the improvement make sense? Or maybe im just doing something silly. External Image