Transfer-Bound Application Looking for ideas to speed it up

Update:

I implemented asynchronous copying using a stream (hopefully correctly!?).

I made a cudaStream_t, used cudaMallocHost() for the host array (as opposed to regular malloc()), used cudaMemcpyAsync() (as oppsed to just cudaMemCpy()) and passed the stream to the kernel calls and cudaMemCpyAsync().

[quote name=‘SPWorley’ date=‘Apr 6 2010, 01:10 PM’ post=‘1034639’]

So you’d set up a stream, and queue up your two kernels, then your ASYNCHRONOUS memory copy back to host, then your “copy done” event, then the two kernels again. Your CPU keeps polling for the mem transfer completion and prints the status as soon as it’s ready… and likely then inserting the next memcopy. event signal, and 2 kernel launches again.

[/code]

I didn’t have any events or re-queue the kernels, as you suggest here SPWorley. Have I done this incorrectly?

Meanwhile, here are some results:

Printing every 100th step (F=100)

– non-async = 103s

– with-async = 93s

Performance increase: 10.75%

Printing every 10th step (F=10):

– non-async = 213s

– with-async = 123s

Performance increase: 73.17%

Printing every step (F=1)

– non-async = 567s

– with-async = 516s

Performance increase: 9.88% [Note: HDD capacity reached here so that may have something to do with this odd result]

You are writing the intermediate data to disk? No need to optimize anything in Cuda then. Your disk access is the sole bottleneck! This would also explain the catastrophic transfer rates you were getting.

In case you fix this and still want to optimize the Cuda part:

You can then order the memory accesses so that global memory access is coalesced, and any reordering occurs while accessing the (orders of magnitude faster) shared memory. Like this: (assuming a blocksize of at least 4*NUM_VARS)

#define NUM_VARS 18 // (I was rounding to 20 in previous posts)

#define BLK_VARS 4  // chosen to give no bank conflicts in shared memory but maximize global memory access width. BLK_VARS==8 may be faster even though bank conflicts occur

// the size of the 'vars' array is always NUM_VARS*NUM_THREADS and it resides in global memory

__global__ kernelA(double* vars, ...){

  unsigned index  =  (blockIdx.x * blockDim.x + threadIdx.x) * NUM_VARS;

  double a,b,c,d,e,f,g,....; // ~100 doubles declared

  __shared__ double temp[BLK_VARS * NUM_VARS];

// Load the current value of each variable through shared memory for coalesced global memory access:

for (unsigned int thread = 0; thread < blockDim.x; thread += BLK_VARS) {

	if (threadIdx.x < BLK_VARS * NUM_VARS)

	  temp[threadIdx.x] = vars[(blockIdx.x * blockDim.x  + thread) * NUM_VARS + threadIdx.x];

	__syncthreads();

	if ((threadIdx.x >= thread) && (threadIdx.x < thread + BLK_VARS)) {

	  a = temp[(threadIdx.x - thread) * NUM_VARS];

	  b = temp[(threadIdx.x - thread) * NUM_VARS + 1];

	  ...

	  n = temp[(threadIdx.x - thread) * NUM_VARS + 17];

	}

	__syncthreads();

  }

/*

   * LOTS OF CALCULATIONS/MATH which modify the 18 variables a-n

  */

// Save the current value of each variable through shared memory for coalesced global memory access:

for (unsigned int thread = 0; thread < blockDim.x; thread += BLK_VARS) {

	if ((threadIdx.x >= thread) && (threadIdx.x < thread + BLK_VARS)) {

	   temp[(threadIdx.x - thread) * NUM_VARS] = a;

	   temp[(threadIdx.x - thread) * NUM_VARS + 1] = b;

	  ...

	   temp[(threadIdx.x - thread) * NUM_VARS + 17] = n;

	}

	__syncthreads();

	if (threadIdx.x < BLK_VARS * NUM_VARS)

	  vars[(blockIdx.x * blockDim.x  + thread) * NUM_VARS + threadIdx.x] = temp[threadIdx.x];

	__syncthreads();

  }

}

int main()

{

  ...

  assert ((BLOCKSIZE % BLK_VARS == 0) && (BLOCKSIZE >= BLK_VARS * NUM_VARS));  // assumptions made in kernelA

  ...

}

The important metric is how often the lmem gets accessed, not how many bytes of lmem are used. So you might still incur a significant slowdown through register spills.

If at any time during kernel execution only either variable A or variable B exist, but not both, the compiler can place them in the same register.

Automatic variables also have (at most) the lifespan of the kernel, so there is no loss here.

OK, re-reading the post I arrive at 6.5MB/s. Originally I had assumed the array size to be 4096 bytes, but it seems do be 4096 doubles.

According to your initial post, the difference between writing out the intermediate results in every step and not writing them out at all is 567s - 91s = 476s. This difference is assumed to be caused by the PCIe transfer of the data (note that this assumption is wrong! it is actually caused by you harddisk). During this time 100,000 * 4096 * 8 bytes = 3.125 Gbytes get transferred, which leads to a transfer rate of 3.125GB / 476s = 6.5 MB/s.

6.5MB/s actually seem like a reasonable transfer rate for a harddisk, assuming that occasional seeks interrupt the sequential writes (are you on Windows? I’d expect the common Linux filesystems to perform a bit better here…)

If your production runs increase the array size from 4096 to 1024 * 1024 * 1024, they will need to transfer 262144 times more data (wonder how you want to store that on your harddisk at all…).

As posted earlier, this is the main problem. How long does it take you to analyze the data?

I guess it is much faster to just recalculate it during analysis, than writing it out to disk and reading back.

Thank you very much for that! I blindly implemented it, with BLK_VARS=4, and this was my result for a particular run:

Before: 91s

After: 102s

So it seems to be slower, unfortunately.

However, I don’t actually understand what’s happening / why you’ve done it so. For example, when loading from gmem to smem:

#define BLK_VARS 4 // Why 4? 

__global___ kernelA(){

__shared__ tmp[BLK_VARS * NUM_VARS];  // Shouldn't this be blockDim.x*BLK_VARS*NUM_VARS

for (unsigned int thread = 0; thread < blockDim.x; thread += BLK_VARS) {

	if (threadIdx.x < BLK_VARS * NUM_VARS){

			// I presume that in combination with the loop the compiler will generate code such

			// accesses are coalesced? 

		tmp[threadIdx.x] = cells[(blockIdx.x * blockDim.x  + thread) * NUM_VARS + threadIdx.x];

		}

	__syncthreads();

		if ((threadIdx.x >= thread) && (threadIdx.x < thread + BLK_VARS)) {  // ?

			  a	 =  tmp[(threadIdx.x - thread) * STATE_SIZE + 0];

			  b	 =  tmp[(threadIdx.x - thread) * STATE_SIZE + 1];

			  ...

		} 

}

}

Ah, I see, interesting!

Edit: I removed one of the 18 vars and re-compiled and lmem usage dropped from 104 to 88 bytes. So, if I can somehow reduce it by 7 then I should have no lmem spillage? But do you think it’ll make a big difference?

Got it, thanks.

I’m on Ubuntu 9.04

Yes… this will be an interesting challenge for us!

All the data analysis will be done after the program has run and will probably be done manually. I’m now thinking I may run the program without printing to the disk and then if I detect (or SEE if I manage to get OpenGL rendering working) I’ll re-run and output to disk for analysis.

Thank you again for your reply and very useful input and ideas, tera! :)

I’ll have a look at it later. Maybe I made a mistake somewhere. Does it give the same result as the previous version?

Was this a run without any intermediate writeout to harddisk? We should make sure we are not measuring fluctuations in disk access timing… How stable are the times if you rerun for exactly the same parameters?

And have we demonstrated that we are optimizing the right part of the kernel? How much time does one run take if you replace the “LOTS OF CALCULATIONS/MATH which modify the 18 variables a-n” part with something trivial like “a+=1; b+=1; … n+=1;”?

Looking at the numbers again, the initial read and final write of vars aren’t the bottleneck in your calculations. Even with uncoalesced accesses, they only take a fraction of a second.

It doesn’t give the same result but I don’t understand it yet so I cannot repair the issue yet.

Yes, these numbers are for runs which do not write to disk, or even Memcpy D2H/H2D. Just computation. Here are results of a run, before and after applying your smem coalescing strategy, which has 4096 threads per kernel call [@128 TPB] and the kernel is called 100,000 times:

Compilation info:

ptxas info : Compiling entry function ‘_Z5tT_1DPdS_S_j’

ptxas info : Used 124 registers, 112+0 bytes lmem, 608+16 bytes smem, 136 bytes cmem[1]

Before:

Running benchmark…

1=90536.3 2=90534.6 3=90535.7

Average Time: 90.5355 seconds.

After:

Running benchmark…

1=103065 2=103065 3=103065

Average Time: 103.065 seconds.

With -use_fast_math:

Running benchmark…

1=103065 2=103065 3=103065 // <— … haha!? How can this be so consistent??

Average Time: 103.065 seconds.

The calculations/math involve the 18 ‘vars’ which are read/written to/from gmem and also ~100 local variables for code legibility. I commented out all the local variable operations such that I just had something trivial:

// Using the same code, as the 103s run (ie. your smem coalescing suggestion) for the read/write to/from gmem

__global__ kernelA(...){

   // Read from gmem

  a += 1;

  b += 1;

  ...

  n += 1;

  // Write to gmem

}

Compilation info:

ptxas info : Compiling entry function ‘_Z5tT_1DPdS_S_j’

ptxas info : Used 46 registers, 608+16 bytes smem, 16 bytes cmem[1] // Note: NO lmem usage now…

Running benchmark…

1=10267.8 2=10268.1 3=10267.4

Average Time: 10.2678 seconds.

Then I thought I’d try and leave all the math involving the 100 variables and just replace 12 of the vars with ‘varN +=1’. I suppose the compiler probably optimized the unused local variables that would have ordinarily been involved in the calculation of each var:

ptxas info : Compiling entry function ‘_Z5tT_1DPdS_S_j’

ptxas info : Used 105 registers, 608+16 bytes smem, 84 bytes cmem[1]

Running benchmark…

1=28678

Average Time: 28.678 seconds.

And so it seems, amazingly, that the calculations are taking up a huge proportion of execution time!? What do you think??

To give you an idea of the calculation code, there’s probably ~150 exp() calls, ~500 multiplications and a handful of divisions and sqrt()'s. Recall that I’m using SM_13 and the double type [out of necessity].

Yes, that’s exactly what I think - I’m not really surprised. I don’t have a reference at hand, but as far as I remember the special function unit is only single precision, so all the exp(), sqrt() and divisions translate to expensive library calls. I don’t know if Fermi’s SFUs are double precision. If they are, it might be worth grabbing one even if GTX470/480 are limited to 1/4 of Fermi’s max double precision throughput.

Problems with many exp() calls often allow for some optimizations - have you taken full advantage of that? E.g. if you evaluate exp() at many equidistant points, you need just two exp() calls and can break all others down to multiplications.

Maybe you can sometimes reduce the need for double precision by operating with small differences to some reference value?

It’s difficult to give advice here without an idea of what the code looks like.

I have a C2050 on order! ;)

I have not yet resorted to double-specific optimizations, but it seems that I’ll have to do so now. I’m not aware of any techniques. Do you happen to have any links on-hand to the one you describe? If not, don’t go to any trouble to find any; I’ll have to look into it quite extensively it seems.

I understand. I really appreciate your help. Thanks once again :)

I have a C2050 on order! ;)

I have not yet resorted to double-specific optimizations, but it seems that I’ll have to do so now. I’m not aware of any techniques. Do you happen to have any links on-hand to the one you describe? If not, don’t go to any trouble to find any; I’ll have to look into it quite extensively it seems.

I understand. I really appreciate your help. Thanks once again :)

You lucky guy! :)

No, I don’t have any references. I was just referring to the decomposition y(n) = exp(x0 + n*d) = exp(x0) * (exp(d))^n = exp(d) * y(n-1).

You lucky guy! :)

No, I don’t have any references. I was just referring to the decomposition y(n) = exp(x0 + n*d) = exp(x0) * (exp(d))^n = exp(d) * y(n-1).

So your kernelB seems to be so complex that it uses more than 124 registers and creates local memory spillage.
Would it be possible to split up that long computation into two smaller kernels?

In some cases you can also reduce register usage slightly (sometimes by 10 to 20%) by declaring local variables
volatile. Especially index variables and other temporaries that require several steps to compute. By default the compiler
uses inlining of such temporaries, in some cases creating redundant computation and using more registers in the
process. Making it volatile makes the compiler compute the temporary right away and holds it in a register.
This can be advantageous if the temporary is used several times afterwards, especially in tight inner loops.
So try playing with the volatile keyword to check what effect it has on your register count (and local memory
use).

Christian

So your kernelB seems to be so complex that it uses more than 124 registers and creates local memory spillage.
Would it be possible to split up that long computation into two smaller kernels?

In some cases you can also reduce register usage slightly (sometimes by 10 to 20%) by declaring local variables
volatile. Especially index variables and other temporaries that require several steps to compute. By default the compiler
uses inlining of such temporaries, in some cases creating redundant computation and using more registers in the
process. Making it volatile makes the compiler compute the temporary right away and holds it in a register.
This can be advantageous if the temporary is used several times afterwards, especially in tight inner loops.
So try playing with the volatile keyword to check what effect it has on your register count (and local memory
use).

Christian

Yes, this is something I’m looking into. It’ll be difficult though as there’s a lot of ODE’s which depend on one another.

Nice tip - thanks Christian!

Yes, this is something I’m looking into. It’ll be difficult though as there’s a lot of ODE’s which depend on one another.

Nice tip - thanks Christian!

Update:

I broke my large kernel into 9 kernels and many permutations of 2-8 kernels and ran tests. I found that the one single kernel out-performed in all cases, even though the smaller kernels had more threads per block! The less kernels, the better. It also appeared to the case for large-ish matrices (65K elements [didn’t try anything bigger because that took 250seconds per test at that size!])

I was surprised that even though occupancy and TPB increased the overall speed decreased. 64 TPB was about 10% faster than 512 TPB! :o