Transfer-Bound Application Looking for ideas to speed it up

This is a bit of a long post/question, but I’ve been stuck on this for a few weeks…

I have an application which, I believe, is transfer-bound (D2D[global memory] and D2H). As such, I’m looking into methods of tackling these issues.

The overall program structure, in pseudo-code, is as such:

//A BIG array of memory is malloc'd on the GPU. (I'm testing with an array of 1024 doubles, but in the future I'll be storing 1024x1024x1024)

for (int n=0; n<BIG_NUMBER; ++n){

   kernelA<<<...>>>(array_ptr, ...);

   kernelB<<<...>>>(array_ptr, ...);

if (n%F==0){  

	 // Copy array back to Host

	 // Print array

  }

}

Note:

  • BIG_NUMBER is usually 100,000 (But will most likely range from 50,000 to 10,000,000 in the future)

  • I’d like F to be 1, in a perfect world, but I find that I need to set it to 100 to get a large performance boost without loosing too much information.

When BIG_NUMBER=100,000 and the array has 4096 elements here are the varying run-times with different values of F:

F = 1 —> 567s

F = 10 —> 213s

F = 100 —> 103s

F = inf. —> 91s

So you can see that if I never copy(D2H)+print (F=infinity) then execution time is 91s, compared to printing at each step which results in an execution time of .

Also, my kernelB has ~20 global memory reads at the beginning and end of the functiion (as I need some variables to persist between kernel calls). I found that if I simply comment them out then my execution time (with F=inf) drops from 91s to 1s! [But, who knows, perhaps the compiler is being naughty? {although I have no optimization flags set}]

To pre-emptively answer some questions you may have at this stage:

  • Yes, I require 2 kernels (as I need grid-level synchronization)

  • Yes, I NEED to print results as I will be examining the state of this array at each point in time - not just the final result. Also, the data which the program outputs is then used to generate plots.

So, the way I see it, I can improve my application speed significantly if I can optimize these 2 ‘transfers’. After reading and thinking about this I can only think of 2 ideas:

1. Making sure my memory transactions are coalesced

I’m not sure if they are? I keep re-reading this section in the programming guide and it just won’t sink in. I have an array of doubles (20 elements for each thread) and each thread writes multiple values to a seperate section of memory… (ie. thread0 writers doubles to array[0.19] and so on…)

2. Perhaps using mapped pinned memory (aka. ‘Zero Copy’)

I read the PinnedMemoryAPI white-paper and am not 100% clear of the benefits (or if it’s appropriate for my problem). If I understand correctly I wouldnt need to allocate any global memory on the GPU AND I’d get greater read/write speeds? But I wouldnt be able to control F, the frequency at which this happens?

If anyone else has any other ideas please feel free to mention them :)

Thanks for taking the time to read this.

ADDITIONAL INFO:

  • GPU = TESLA C1060 on a decent i7 system

  • OS = Ubuntu 9.04

  • SDK = 2.3

  • Kernel Profile:

External Media

  • Kernel ptxas info:

kernelA - ptxas info : Used 124 registers, 104+0 bytes lmem, 28+16 bytes smem, 132 bytes cmem[1]

kernelB - ptxas info : Used 10 registers, 16+16 bytes smem, 4 bytes cmem[1]

It really depends on the actual application bottlenecks and memory transfer sizes, BUT with some async mem copies you likely can get a lot better throughput.
The key is that if you use streams and events, you can make that memcopy asynchronous so you’re transferring data to the host even while the GPU is already working on the next round of compute. This simultaneous transfer and execution does require a Compute 1.2+ device though.

So you’d set up a stream, and queue up your two kernels, then your ASYNCHRONOUS memory copy back to host, then your “copy done” event, then the two kernels again. Your CPU keeps polling for the mem transfer completion and prints the status as soon as it’s ready… and likely then inserting the next memcopy. event signal, and 2 kernel launches again.

This setup may or may not solve your problem depending on the kernel speed as compare to your memory copy speed, but it won’t hurt.
What this WON’T do is let you do analysis of data on the CPU and then use the results to modify the next kernel launches. If you do need to do that, then you’re pretty much have to let the GPU grind to a halt like you’re doing now.

Your idea of using zero-copy memory MIGHT also be successful… it depends on your memory access pattern. You’re right that you still need to play synchronization games to make sure one iteration of your kernel doesn’t overwrite your zero-copy memory even while the CPU is still plotting it. This could be done again with streams, and a little check in the KERNEL that knows whether it should copy the memory this iteration or not.

The compiler aggressively removes dead code, even if you have no optimization flags set. I find the best way to determine to contribution of a block of code to the runtime is to double it up, rather than remove it.

That pattern is not coalesced. The way to think of coalescing is that it is striping writes across the threads. This allows the multiprocessor to group writes from many threads into big transactions that make better use of the wide memory bus on the device. One way to see this is to think about what is happening when you come to a memory write instruction. The first write instruction (which runs in parallel for the entire half-warp) will have thread 0 write to array[0], thread 1 to array[20], thread 2 to array[40], and so on. The problem is that these locations are spaced out so far that this one instruction has to be handled as 16 separate memory transactions.

In a coalesced write, thread 0 writes to array[0], thread 1 writes to array[1], thread 2 writes to array[2], and so on. These simultaneous adjacent writes can be grouped into long 64-byte writes that maximize throughput in the memory controller. (Compute capability 1.2 and greater devices are more flexible about coalescing than I’ve described, so you can mix the above pattern up, but the key feature is being able to batch writes by threads in the same half-warp into contiguous chunks of memory.)

I see, thanks for your input. I do have a 1.3 compute capability device (TESLA C1060) and so I’ll try these 2 approaches and report back.

But, from what I read, I can’t do both; as the whole purpose of the ‘zero-copy’ approach is to avoid the use of streams/events. So maybe I’ll just try the former.

Thanks for confirming my suspicions. Also, increasing the amount of code is an interesting idea!

So is my ‘stride’ 20? And is it 20 because I have 20 read/writes in my code?

Hmm, I don’t see how this fixes the issue - I’ll re-read the coalescing stuff to refresh my memory and combine that information with what you’ve just presented me with. Your description is good, thanks. What I don’t understand is why the write instructions write 20 (or N, in the general case) memory locations apart? Must be because of the hardware?

What does this sentence mean?? What’s a word? How big is the segment?

How does one go about coalescing their global memory accesses?

Also, is there a debugger or profiler that can execute my code and simply tell me whether I have coalesced or non-coalesced memory access? cudaprof doesn’t seem to tell me this?

if you posted more of your kernel code, I could help pointing out whether or not your loads and stores are indeed coalesced and if not: how to fix it ;)

Yes and yes

Since in your code thread 0 writes at array[0]…array[19] and thread 1 writes at array[20] … array[39] and so on, you try to write first at array[0] and array[20] then at array[1] and array[21] … .

Remember that threads are parallel. So the problem is that one write (or read) instruction can only access one segment of the memory (it sizes depend on what you type of data you try to access) at the same time. So if you are doing n write instructions which fall within one segment of memory it can be done in one instruction for each half warp (16 threads). This means if you want to start k threads per block, thread 0 should access array[0] array[k] array[2k] and so on, while thread 1 accesses array[1] array[k+1] array[2k+1] and so on.

A word is one variable, for example a character, an integer, a float or a double.

Depends on how large your variables are, if you write integer (4 byte), float (4 byte) or double (8 byte) it is 128 byte large.

Btw.

If you want to produce some output from your data, how about calculating that output on the GPU? I mean you dont really want to plot the equivalent of several Gigabytes of data in the end right?

So basically you want to download those large arrays to calculate some data points for a plot (each data point can consist maximum of n floats for an n dimensional plot this is only 8 bytes per point for a 2d plot for example.). So instead of downloading really large arrays write a third kernel which calculates the points for your plot and only download those points.

Best regards

Ceearem

Remember that when you look at a loop like this in device code:

int offset = 20 * threadIdx.x

for (int i=0; i < 20; i++) {

  out[i + offset] = 0.0;

}

a single execution of the line that writes to the out array is actually happening in simultaneously across the entire half-warp. It’s not 16 different instructions, but rather 1 instruction that draws operands from the register files of each thread at the same time. The key to coalescing is that those implied simultaneous writes be to adjacent locations in some fashion. The “stride” is whatever you make it in your index calculation, since the value of offset is different for each thread.

The curse and blessing of CUDA is that the programming model lets you focus on what a single thread does, at the exclusion of the others. This is very convenient, except when threads can have performance impacts on each other. One situation is I/O to global memory, and another is branching.

It’s probably too early to focus on optimizing coalesced memory access or even making the kernel run faster at all. From the first post, it’s shown that the basic compute takes 91 seconds, but with the memcopy back to CPU it takes 567 seconds. So 85% of the runtime is in that CPU coordination… likely just idle GPU while the CPU is syncing and transferring and then relaunching the next kernel loop.

I’d focus on that first and find what the bottleneck is. One quick way at first is to double the mem transfer (just do it TWICE) and see how much that slows it down. If a double transfer doubles the runtime, it may be bandwidth. More likely, the double transfer won’t slow things down too much and that means it’s just launch scheduling, which can be improved by asynchronous streams, keeping the GPU work queue full.

I can’t post my code verbatim, but, the overall structure is very simple. Here is kernelA (which, according to profiling, takes up 90% of GPU execution time):

#define NUM_VARS 18 // (I was rounding to 20 in previous posts)

// the size of the 'vars' array is always NUM_VARS*NUM_THREADS and it resides in global memory

__global__ kernelA(double* vars, ...){

  unsigned index  =  (blockIdx.x * blockDim.x + threadIdx.x)*NUM_VARS;

  double a,b,c,d,e,f,g,....; // ~100 doubles declared

// Load the current value of each variable

  a = vars[index];

  b = vars[index+1];

  c = vars[index+2];

  ...

  n = vars[index+17];

/*

   * LOTS OF CALCULATIONS/MATH which modify the 18 variables a-n

  */

// Save the current value of each variable

  vars[index] = a;

  vars[index+1] = b;

  vars[index+2] = c;

  ...

  vars[index+17] = n;

}

Edit: I’ve updated my original post so as to include a profiling.

Agreed. Here are my results:

F=inf| Once=91s , Twice= 95s

F=1 | Once= 567s , Twice= 527s ← wtf!?

I am perplexed, to say the least…

A couple of thoughts:

    [*] Restructure your data so the code reads

#define NUM_VARS 18 // (I was rounding to 20 in previous posts)

// the size of the 'vars' array is always NUM_VARS*NUM_THREADS and it resides in global memory

__global__ kernelA(double* vars[NUM_VARS], ...){

  unsigned index  =  (blockIdx.x * blockDim.x + threadIdx.x);

  double a,b,c,d,e,f,g,....; // ~100 doubles declared

// Load the current value of each variable

  a = vars[0][index];

  b = vars[1][index];

  c = vars[2][index];

  ...

  n = vars[17][index];

/*

   * LOTS OF CALCULATIONS/MATH which modify the 18 variables a-n

  */

// Save the current value of each variable

  vars[0][index] = a;

  vars[1][index] = b;

  vars[2][index] = c;

  ...

  vars[17][index] = n;

}

in order to get coalesced access to the variables. If this is not possible, move them through temporary variables in shared memory so that the global memory accesses are coalesced.

[*]100 doubles (200 registers) of automatic variables is quite a lot. Hopefully the compiler can optimize some of them away.

Check the occupancy that results from the number of registers used, and the amount of local memory your kernel might use due to register spills.

Try to reorder the calculation so that some variables have non-overlapping lifespan. Alternatively, put variables into shared memory rather having them spill into slow global memory.

[*]If I read the numbers in your initial post correctly, the device to host copies only reach a rate of 0.8 MB/s. I guess I just read them wrong. If this were really true, there’s something very wrong with the memory transfer.

[*]Even if you get the transfer fixed to reach decent rates in the GB/s range, the millionfold increase in your production runs will more than compensate for that. I wonder how you actually manually check the interim results of each iteration, and I doubt that your mental bandwith can get anywhere near the GB/s the device is capable of.

Thus, whatever post-processing is done on the data for presentation, do it on the card and transfer only the processed result. Or do the whole plot on the card and display it via OpenGL, so the data never needs to be transferred off the device at all.

Yeah, those 18 doubles as local variables alone will require 36 registers per thread (assuming the compiler actually decides to hold them all in registers at the same time, which is unlikely)

But If you’re not extra careful (the default register limit is 64 I think), this will spill to local memory. You might be able to fit some of these locals into shared memory - it may reduce register pressure quite a bit.

Are you sure that all these state variables need double precision?

Can you give us some stats about shared and local memory usage and registers for Kernel A? Also what are the block dimensions you’re using currently? If you’re using CUDA 2.3 you’ll find the stats in the .cubin files (when you use the --keep option), on toolkit 3.0 you will have to turn the PTXAS assembler to verbose mode somehow to get the stats. Ah, the CUDA profiler will also report these values.

Christian

In my experience, I dont remember seeing a linear correspondence between the local variables and registers used per thread…

deleted my posting, I misread something

Thank you for your response! :)

How is this usually achieved? With tricky offsets or a different data-layout?

At the moment I’m outputting the state of 1 of those 18 variables, for each thread, every so often into a file. Then, after my CUDA program has completed, I plot this data using gnuplot. In the future I’d like to be able to use OpenGL to render it on the card directly but I don’t know any OpenGL and also I will probably need to output data in some form for more complex rendering (like some neat 3D visualisations in a rendering app like Autodesk 3ds max, for example).

I do like your idea, but I’ll have to learn OpenGL(/[GLU/GLEW]?) first!

YES, thank you! It just ‘clicked’ in my mind - I see it now!! :D

Can you please confirm this is correct:

  • So threads 0-15 will, on the first line, attempt to read/write to array indexes 020,120,220,…1520 and this will be increased to multiple reads/writes because the memory locations are too far apart!

  • The total ‘span’ of these memory locations is what the Programming Guide refers to as ‘the segment’

  • Because I’m dealing with doubles (8 bytes each) my segment size needs to be exactly 128 bytes to coalesce

  • My segment, at the moment, is of size 15*20=300 bytes

YES, thank you! It just ‘clicked’ in my mind - I see it now!! :D

Can you please confirm this is correct:

  • So threads 0-15 will, on the first line, attempt to read/write to array indexes 020,120,220,…1520 and this will be increased to multiple reads/writes because the memory locations are too far apart!

  • The total ‘span’ of these memory locations is what the Programming Guide refers to as ‘the segment’

  • Because I’m dealing with doubles (8 bytes each) my segment size needs to be exactly 128 bytes to coalesce

  • My segment, at the moment, is of size 15*20=300 bytes

Yes. Each half warp (16 threads) will issue a memory load request. Because the width of the request doesn’t fit inside a contiguous, 16 word aligned, 16 word long segment (with a word size of either 32, 64, or 128 bit words), there can be no coalesced access, and the request is decomposed into a series of separate transactions to service the load request. In the worst case, this leads to 16 serial transaction, although on recent hardware it can be less than that.

Edited. Yes, in the v3.0 programming guide that is what they mean - a segment is the address range of a single half-warp read request.

Almost. It has to be 128 bytes and it has to be aligned to a 128 byte boundary in memory.

No. The ‘segment’ in that code snippet you quoted is 1620sizeof(double) = 2560 bytes.

Nice - thanks!

I don’t follow - how will placing them in shared memory (for the duration of the kernel execution) coalesce global memory access?

The function compiles to use 124 registers. Resultant occupancy is 25%, IIRC. I’ve put the ptxas info in my original post, but it seems hardly any local memory spillage occurs, if any:

kernelA - ptxas info : Used 124 registers, 104+0 bytes lmem, 28+16 bytes smem, 132 bytes cmem[1]

What do you mean here by ‘non-overlapping lifespan’ of variables?

I like the idea of putting variables in shared memory but, unfortunately, shared memory variables, according to the programming guide, only have the lifespan of the kernel execution…

Huh!? Which numbers are you talking about? How did you get 0.8 MB/s?

Sorry, I don’t understand what you’re saying here?

The iterim results are not checked, processed or even looked at. They are only used post-CUDA processing for plotting (the output of the program is passed to gnuplot).

I’d love to do this on the card, but I do not know OpenGL… yet! Regardless, even if done on the card, I’d still need to print statistics out every so often in this manner anyway (for manual analysis).

Thank you for your thoughts/ideas. They are much appreciated :)