Vehicle Routing Problem with CUDA

I’m designing an application to solve a vehicle routing problem using ant colony optimization, and I’m looking for some advice to make sure that I can take advantage of GPU processing and CUDA at some point in the future. The application will be written in C++ and will run on linux. I’d like to keep as many options open as possible and GPU processing seems like an ideal means to solve a vehicle routing problem, I’d also like to be able to have a lot of scalability available if needed i.e multiple servers and multiple Tesla cards all working on a subset of the problem and feeding the results back to a central server.

I’m very new to GPU processing and CUDA so sorry if any of this information is already available else where……I did a quick search in the forum and on Google and couldn’t find anything that answered my questions

The questions I’ve got are:

  1. Are there any specific threading api’s that work well (or should be avoided) when working with CUDA….I’m weighing up the pro’s and con’s of posix threads, OpenMP, Intel Threading Building Blocks and Boost.thread. I read somewhere that a CPU thread can only talk to one GPU at a time.

  2. As the code will initially just use standard CPU type processing albeit split over many threads, are there any steps I should think about to allow CUDA and GPU processing to be added in at a later date with minimal effort/re-coding.

  3. I’m planning to use MPI to allow the program to be split over several servers. Is there anything I should consider when using MPI with CUDA

To quickly introduce what the program will do……

There are a number of customers that must be services by a fleet of vehicle and the program will find the optimal route for each of the vehicles to ensure that each customer is visited whilst keeping within certain parameters.

To solve the problem I’m planning to ant colony optimization which uses an algorithm based on a set of ants. Each will act independently of all the other ants and will try various routes to solve the problem. Even though the ants are operating independently they will share a common view of the road network and customers. As the ants travel across the roads they will leave behind a pheromone trail. Other ants will use a combination of the pheromone trail and other factors when deciding which road/customer they should choose next. Over time the pheromone trail will evaporate, so roads that are frequently travelled by ants will have a stronger pheromone trail, whilst other roads will have a weak pheromone trail.

Over several thousand iterations the ants will identify the optimal routes of each of the vehicles.

You read right. CUDA requires a separate host thread per GPU context. I haven’t tried Intel’s threading primatives with CUDA, but pthreads and Boost threads work well (I would probably guess Boost is the most widely used). OpenMP works as well, but (in my opinion) the OpenMP programming model isn’t all that well suited to generalized multithreading, even though it is useful enough for coding parallel compute applications.

Data structures are probably the key area where you can make life much more complex than it need be. Like most distributed memory paradigms, the CUDA model is a lot simpler to work with if you use flat arrays and indexing (or perhaps structures of arrays) rather than pointer based trees and elaborate arrays of structures. You code might wind up looking more like FORTRAN 77 than C++, but what you trade off in elegance, you get back many fold in portability and performance.

MPI works fine with CUDA. In setups with multiple GPUs per MPI host and multiple MPI hosts, you might need to go to a multi level communicator structure (colouring) to get optimal GPU affinity, but I haven’t had any major problems using the two (in fact I usually skip host threading for multi-GPU and just use MPI, but then most of my code winds up running on distributed memory systems anyway, and MPI is certainly a lot heavier IPC model than pthreads).

To add to avidday great answer, I find myself many times merging different/related arrays into a flat array of float2/float4.

For example if you have two arrays, instead of using:

vector<float> m_dataA;

vector<float> m_dataB;

vector<float> m_dataC;

vector<float> m_dataD;

use this:

float4 *m_pData;

....

for ( int i = 0; i < SIZE; i++ )

{

   m_pData[ i ].x = ....;   // Value of m_dataA[ i ]

   m_pData[ i ].y = ....;   // Value of m_dataB[ i ]

   m_pData[ i ].z = ....;   // Value of m_dataC[ i ]

   m_pData[ i ].w = ....;   // Value of m_dataD[ i ]

}

Reading float4/float2 from global memory will be faster than reading 4 or 2 independant single floats.

eyal

Not true. You can get peak bandwidth from accessing float, float2 and float4 (or equivalents in terms of size). I have a bandwidth tester made by someone from the forum that can show this if anyone’s interested (I wish I remembered who was the author).

IIRC a long time ago in G80s it was true that when reading textures, float4 were faster. But that was only with textures and only with G80s.

Actually I’ve tested probably the same code you’re refering from MisterAnderson from the forums.

It defenetly shows that float4 for gmem read (he showed from textures and read/write as well) get better peak performance

over float and float2. Search the forums for MisterAnderson’s benchmark.

I ran it yesterday and float2 was ~80% faster than single float and float4 was a bit higher.

Here’s the link - mind you that his statistics were done on old card:

http://forums.nvidia.com/index.php?showtop…mp;#entry290441

These are the numbers for my GTX280:

read_only_gmem<float> - Bandwidth:	31.745642 GiB/s

read_only_gmem<float2> - Bandwidth:	54.892191 GiB/s

read_only_gmem<float3> - Bandwidth:	26.831120 GiB/s

read_only_gmem<float4> - Bandwidth:	46.178393 GiB/s

read_only_tex<float> - Bandwidth:	33.811652 GiB/s

read_only_tex<float2> - Bandwidth:	58.366266 GiB/s

read_only_tex<float4> - Bandwidth:	71.162112 GiB/s

eyal

Ahh, good old bw_test. Your GTX 280 results are very interesting. I find much flatter results among the different sizes on Tesla S1070. I wonder what the differences are? And shouldn’t A GTX 280 be pegging closer to 100+ GiB/s?

copy_gmem<float> - Bandwidth:	67.960869 GiB/s

copy_gmem<float2> - Bandwidth:	74.566497 GiB/s

copy_gmem<float4> - Bandwidth:	72.828753 GiB/s

copy_tex<float> - Bandwidth:	66.388596 GiB/s

copy_tex<float2> - Bandwidth:	72.134653 GiB/s

copy_tex<float4> - Bandwidth:	75.224281 GiB/s

write_only<float> - Bandwidth:	61.352675 GiB/s

write_only<float2> - Bandwidth:	66.332978 GiB/s

write_only<float4> - Bandwidth:	67.814618 GiB/s

read_only_gmem<float> - Bandwidth:	61.314812 GiB/s

read_only_gmem<float2> - Bandwidth:	79.614283 GiB/s

read_only_gmem<float4> - Bandwidth:	45.674393 GiB/s

read_only_tex<float> - Bandwidth:	59.928306 GiB/s

read_only_tex<float2> - Bandwidth:	69.090485 GiB/s

read_only_tex<float4> - Bandwidth:	66.968430 GiB/s

Can I add to the confusion with a set of numbers from a stock GTX275 without an attached display:

copy_gmem<float> - Bandwidth:	105.217803 GiB/s

copy_gmem<float2> - Bandwidth:	106.697101 GiB/s

copy_gmem<float4> - Bandwidth:	98.037641 GiB/s

copy_tex<float> - Bandwidth:	107.958303 GiB/s

copy_tex<float2> - Bandwidth:	111.421910 GiB/s

copy_tex<float4> - Bandwidth:	111.079990 GiB/s

write_only<float> - Bandwidth:	67.699114 GiB/s

write_only<float2> - Bandwidth:	68.142635 GiB/s

write_only<float4> - Bandwidth:	67.403827 GiB/s

read_only_gmem<float> - Bandwidth:	36.004807 GiB/s

read_only_gmem<float2> - Bandwidth:	62.574882 GiB/s

read_only_gmem<float4> - Bandwidth:	51.968654 GiB/s

read_only_tex<float> - Bandwidth:	38.499446 GiB/s

read_only_tex<float2> - Bandwidth:	66.209763 GiB/s

read_only_tex<float4> - Bandwidth:	94.948009 GiB/s

My read numbers show the same trends as eyal’s but the numbers themselves are considerably higher than the GTX280 results, which is odd given the GTX275 only has a 448 pin memory bus. And it looks like both BIg_Mac and Eyal are right - you can hit almost peak bandwidth with each type, but the read performance, both from global memory and textures, is better with the vector types than the scalar.

And here are mine for a lowly 8800 GTS 512

copy_gmem<float> - Bandwidth:	49.952890 GiB/s

copy_gmem<float2> - Bandwidth:	49.029506 GiB/s

copy_gmem<float4> - Bandwidth:	46.918432 GiB/s

copy_tex<float> - Bandwidth:	45.822101 GiB/s

copy_tex<float2> - Bandwidth:	48.088663 GiB/s

copy_tex<float4> - Bandwidth:	48.237572 GiB/s

write_only<float> - Bandwidth:	38.122706 GiB/s

write_only<float2> - Bandwidth:	38.275760 GiB/s

write_only<float4> - Bandwidth:	38.362752 GiB/s

read_only_gmem<float> - Bandwidth:	50.101352 GiB/s

read_only_gmem<float2> - Bandwidth:	42.559209 GiB/s

read_only_gmem<float4> - Bandwidth:	37.553127 GiB/s

read_only_tex<float> - Bandwidth:	46.070905 GiB/s

read_only_tex<float2> - Bandwidth:	52.863209 GiB/s

read_only_tex<float4> - Bandwidth:	52.308339 GiB/s

Fairly consistent with an interesting dip with float2 and float4 gmem reads. Using float4s would actually make some code slower on my GPU.

You’re correct regarding my GTX280 (its primary card on windows - maybe that can explain)… I’ll check tomorrow with a C1060 and a S1070 on linux. Also, from past conversations (in the forums) I was indeed under the impression that float4 would be faster than float/float2 - only

reasonable to get 128 bit in one read instead of 4 32bit reads.

In my main algorithm, moving from 4 float reads from textures to 2 float2 reads - I gained ~30% performance boost.

As for the float4 vs float2 (gmem float2 reads being faster than float4 on all cards), I’ll check on mine tomorrow and report.

might get another few % boost from using float2 instead of float4 :)

eyal

Hi,

Good news first :) - I think there is a small bug in MisterAndreson’s code.

The lines in the original code:

...

*((float *)(&shared[(threadIdx.x + 1) & (BLOCK_SIZE-1)])) += 1.0;

...

Should be:

*((float *)(&shared[(threadIdx.x + 1) & (BLOCK_SIZE-1)])) += 1.0f;  // Use 1.0f instead of 1.0

Without the f the code should use doubles by default if I’m not mistaken.

Anyway - changing to use the 1.0f code - I got more reasonable results:

Windows GTX280 - main graphic card:

copy_gmem<float> - Bandwidth:	110.190408 GiB/s

copy_gmem<float2> - Bandwidth:	107.160643 GiB/s

copy_gmem<float4> - Bandwidth:	85.671774 GiB/s

copy_tex<float> - Bandwidth:	107.959503 GiB/s

copy_tex<float2> - Bandwidth:	114.442128 GiB/s

copy_tex<float4> - Bandwidth:	116.107641 GiB/s

write_only<float> - Bandwidth:	66.120493 GiB/s

write_only<float2> - Bandwidth:	68.612180 GiB/s

write_only<float4> - Bandwidth:	68.558899 GiB/s

read_only_gmem<float> - Bandwidth:	62.264062 GiB/s

read_only_gmem<float2> - Bandwidth:	83.538114 GiB/s

read_only_gmem<float3> - Bandwidth:	38.970951 GiB/s

read_only_gmem<float4> - Bandwidth:	45.673215 GiB/s

read_only_tex<float> - Bandwidth:	63.386941 GiB/s

read_only_tex<float2> - Bandwidth:	100.702475 GiB/s

read_only_tex<float4> - Bandwidth:	97.344738 GiB/s
Windows - C1060:

copy_gmem<float> - Bandwidth:	71.168222 GiB/s

copy_gmem<float2> - Bandwidth:	76.422516 GiB/s

copy_gmem<float4> - Bandwidth:	74.814419 GiB/s

copy_tex<float> - Bandwidth:	70.430719 GiB/s

copy_tex<float2> - Bandwidth:	76.336920 GiB/s

copy_tex<float4> - Bandwidth:	77.340876 GiB/s

write_only<float> - Bandwidth:	68.881375 GiB/s

write_only<float2> - Bandwidth:	71.039391 GiB/s

write_only<float4> - Bandwidth:	70.674548 GiB/s

read_only_gmem<float> - Bandwidth:	62.858476 GiB/s

read_only_gmem<float2> - Bandwidth:	82.381297 GiB/s

read_only_gmem<float3> - Bandwidth:	26.859308 GiB/s

read_only_gmem<float4> - Bandwidth:	46.392746 GiB/s

read_only_tex<float> - Bandwidth:	64.596099 GiB/s

read_only_tex<float2> - Bandwidth:	72.354688 GiB/s

read_only_tex<float4> - Bandwidth:	71.344154 GiB/s

So maybe you guys could re-run your tests with the fix.

It seems though that indeed float4 doesnt contribute too much (and in the readonlygmem even gets worse performance)

float2 operations are constantly the best.

Two more things disturbs me, however:

  1. The tests are done using a mere 4MB of data - this might be a too small of a test case. I’ve managed to change the code

    to use 40MB and the results were still the same. I’ll try to run with ~800MB and see.

  2. The code in the read_only_xxx methods might not give the correct answers:

__shared__ T shared[BLOCK_SIZE];

	shared[threadIdx.x] = g_idata[idx];

	*((float *)(&shared[(threadIdx.x + 1) & (BLOCK_SIZE-1)])) += 1.0f;

All those calculations take time and the bandwidth is decided by the total elapsed time divided to the data read. This might

cause the numbers to be lower than they actually are. I know this is an issue nVidia also stumbled on in the Visual Profiler.

What do you think?

I’ll run the code on a GTX295 linux and maybe on S1070 as well and report again…

thanks

eyal

Nice catch. Here are my updated results on a stock GTX275:

avid@cuda:~$ module load cuda/2.3

avid@cuda:~$ nvcc -arch sm_13 -o bw_test bw_test.cu 

avid@cuda:~$ ./bw_test 

copy_gmem<float> - Bandwidth:	105.218827 GiB/s

copy_gmem<float2> - Bandwidth:	106.702942 GiB/s

copy_gmem<float4> - Bandwidth:	98.027962 GiB/s

copy_tex<float> - Bandwidth:	107.954979 GiB/s

copy_tex<float2> - Bandwidth:	111.419590 GiB/s

copy_tex<float4> - Bandwidth:	111.080472 GiB/s

write_only<float> - Bandwidth:	67.696924 GiB/s

write_only<float2> - Bandwidth:	68.141381 GiB/s

write_only<float4> - Bandwidth:	67.403741 GiB/s

read_only_gmem<float> - Bandwidth:	71.441638 GiB/s

read_only_gmem<float2> - Bandwidth:	95.762403 GiB/s

read_only_gmem<float4> - Bandwidth:	52.389702 GiB/s

read_only_tex<float> - Bandwidth:	71.498755 GiB/s

read_only_tex<float2> - Bandwidth:	97.963488 GiB/s

read_only_tex<float4> - Bandwidth:	90.378550 GiB/s

which look a lot more sensible, except perhaps for the float4 global memory read performance, which looks a bit anomalous.

Here are the results of linux+GTX295 (ran on one half, the machine has 3 GTX295):

copy_gmem<float> - Bandwidth:	89.533044 GiB/s

copy_gmem<float2> - Bandwidth:	91.090342 GiB/s

copy_gmem<float4> - Bandwidth:	83.906285 GiB/s

copy_tex<float> - Bandwidth:	91.179554 GiB/s

copy_tex<float2> - Bandwidth:	94.452564 GiB/s

copy_tex<float4> - Bandwidth:	94.215312 GiB/s

write_only<float> - Bandwidth:	58.590617 GiB/s

write_only<float2> - Bandwidth:	59.084863 GiB/s

write_only<float4> - Bandwidth:	58.427421 GiB/s

read_only_gmem<float> - Bandwidth:	61.427700 GiB/s

read_only_gmem<float2> - Bandwidth:	80.811450 GiB/s

read_only_gmem<float4> - Bandwidth:	44.508571 GiB/s

read_only_tex<float> - Bandwidth:	61.865818 GiB/s

read_only_tex<float2> - Bandwidth:	83.759034 GiB/s

read_only_tex<float4> - Bandwidth:	77.289353 GiB/s

Yes seems like our results are synced.

float4 is indeed weird. What do you think? could there be that there is no “real” 128 bit transaction and its actually 4 x 32 bit reads?

maybe it somehow relates to another thread I’ve opened :

http://forums.nvidia.com/index.php?showtopic=156065

???

eyal

It seems using float2 only have very marginal bandwidth boost compare with using float. That is out of my expectation. I would expect to get around 2-fold since each memory access instruction will get twice data. float4 result is not that suprising to me since according to nvidia document, if each thread accessing 128bit words, it will result in 2 128-byte memory transaction, that is 2 float2 read.

Well, I guess it is now bounded by the memory bus speed rather than the latency. Indeed, the result of 295 and 275 shows the bandwidth achieved is around 80% of the theoretical bandwidth limit. So this kernel is not latency bound but bandwidth bound.

I am guessing the latency is hided efficiently by thread scheduling. Maybe we should add share memory used per thread to limit the thread size per block or reduce the problem size? So that we can see the raw performance of reading float2 and float4?

Yes you’re correct this is what we thought as well. Weird that the only significant tests that shows improvement in float2 over float are the

read_only tests (gmem and texture).

You mean: “will result in 2 64-byte?” and not “2 128-byte” ???

Can you please point where you saw this statement in the documents?

I actually thought of increasing the problem size, as this is the probably more realistic test and might remove issues such as

partition camping et al…

eyal

You mean: “will result in 2 64-byte?” and not “2 128-byte” ???
Can you please point where you saw this statement in the documents?

1 float = 4 bytes (32bit)
So for a half-wrap, 16 threads coalesced read float requires 64bytes, float2(64bit) requires 128bytes and float4(128bit) requires 256bytes.

In Cuda programming guide, page 75-76 (mine is version 2.2):
Coalescing on Devices with Compute Capability 1.0 and 1.1
The global memory access by all threads of a half-warp is coalesced into one or two
memory transactions if it satisfies the following three conditions:
1 Threads must access
􀂉 Either 32-bit words, resulting in one 64-byte memory transaction,
􀂉 Or 64-bit words, resulting in one 128-byte memory transaction,
􀂉 Or 128-bit words, resulting in two 128-byte memory transactions;
2 and 3 are the alignment requirement and accessing order requirement

In compute capability 1.2 above, things changed slightly
Coalescing on Devices with Compute Capability 1.2 and Higher
The global memory access by all threads of a half-warp is coalesced into a single
memory transaction as soon as the words accessed by all threads lie in the same
segment of size equal to:
􀂉 32 bytes if all threads access 8-bit words,
􀂉 64 bytes if all threads access 16-bit words,
􀂉 128 bytes if all threads access 32-bit or 64-bit words.
and
If a half-warp addresses words in n different segments, n memory transactions are
issued (one for each segment), whereas devices with lower compute capabilities
would issue 16 transactions as soon as n is greater than 1. In particular, if threads
access 128-bit words, at least two memory transactions are issued.

I see. Seems you’re correct :) that means that float4 is not such a good idea after all…weird…

thanks

eyal

Not really that weird. I mean all those types of transactions (64B, 128B, 2x128B) can be theoretically executed at full bandwidth. Which happens to be the fastest may very well depend on the model or even revision of the particular GPU. Seeing as gmem accesses are quite consistent and only textures get funky, it’s possible that cache-logic is involved as well. And that’s evil black voodoo that’s affected by sunspots and the phase of the moon.

Maybe Fermi’s ECC can assist here ;)