Significant speedup with vector types - why?

Hi everyone!

I’m working on the performance analysis of a simple Monte Carlo application for financial risk evaluation. I’m observing a significant speed-up when using vector types. I cannot explain this observation, as my GPU (Quadro FX 5600, G80) does not have vector units. Does someone has an idea of how to explain this?

The applications calculates ~100k identical simulations. Each simulation consists of generating random parameters (param) and an analysis of data under these parameters, i.e. a big sum of many sum += f( data[i], param ). I use vectors of different sizes to execute several simulations at once, i.e. sum and param are vectorized while data is not.

Furthermore, I use a blocked single buffer scheme to acces data, and I can vary the number of simulations executed by one work-item (in order to share global memory access). If there are further questions of what the algorithm does, I’ll be glad to answer them. Here are the most important parts of the code:

#if SIZE==1

#define FLOAT float

#define FIXED uint

#define SFIXED int

#define STRIDESHIFT 2

#define SELECT(_a, _b, _c) (_c ? _b : _a)

#define CONVERT_FLOAT convert_float

#endif

#if SIZE==2

#define FLOAT float2

#define FIXED uint2

#define SFIXED int2

#define STRIDESHIFT 3

#define SELECT(_a, _b, _c) bitselect(_a, _b, as_float2(_c))

#define CONVERT_FLOAT convert_float2

#endif

// and so on for SIZE = 4 8 16

__inline FLOAT EVALUATE_POSITION( int   type,

								  int   dayOfMaturity,

								  FLOAT stockIndexValue,

								  FLOAT foreignInterestRate,

								  FLOAT exchangeRate,

								  int   today ) {

	switch( type ) {

		case POSITION_TYPE_STOCK:

			return stockIndexValue;

		case POSITION_TYPE_FBOND:

			return exchangeRate * 100 *

				exp( - ( foreignInterestRate / 100 ) * CONVERT_FLOAT( (SFIXED)( dayOfMaturity - today ) ) );

		default:

			return FLT_MAX;

	}

}

__kernel void

varmc_kernel(					   int	n_simulations,

									int	n_positions,

				   __constant const float *dm_sqrtAddCovar,

				   __constant const float *dm_muPlusParam,

				   __global   const int   *dm_type,

				   __global   const int   *dm_dayOfMaturity,

				   __global   const int   *dm_amount,

				   __global		 FLOAT *dm_answer ) {

	// Calculate simulated market parameters

	FLOAT stockIndexValue[SIM_BLOCK_SIZE], foreignInterestRate[SIM_BLOCK_SIZE], exchangeRate[SIM_BLOCK_SIZE];

	gen_market_params( stockIndexValue, foreignInterestRate, exchangeRate,

					   dm_sqrtAddCovar, dm_muPlusParam );

	float today = dm_muPlusParam[N_MARKET_PARAM];

	// Calculate sum of position values

	__private FLOAT pm_sum[SIM_BLOCK_SIZE];

	for( int i_sim = 0; i_sim < SIM_BLOCK_SIZE; i_sim++ ) {

		pm_sum[i_sim] = ZERO;

	}

	

	for( int start = 0; start < n_positions; start += POS_BLOCK_SIZE ) {

		int block_size = min( POS_BLOCK_SIZE, n_positions - start );

		

		// Prefetch data for the next POS_BLOCK_SIZE positions

		__local int sm_type[POS_BLOCK_SIZE], sm_dayOfMaturity[POS_BLOCK_SIZE];

		event_t event;

		event = async_work_group_copy( &(sm_type[0]),		  &(dm_type[start]),		  (size_t)block_size, (event_t)0 );

		event = async_work_group_copy( &(sm_dayOfMaturity[0]), &(dm_dayOfMaturity[start]), (size_t)block_size, event );

		wait_group_events( 1, &event );

		

		// Calculate next POS_BLOCK_SIZE positions

		for( int i_sim = 0; i_sim < SIM_BLOCK_SIZE; i_sim++ ) {

			FLOAT sum = ZERO;

			for( int i = 0; i < block_size; i++ ) {

				sum += EVALUATE_POSITION( sm_type[i], sm_dayOfMaturity[i],

										  stockIndexValue[i_sim], foreignInterestRate[i_sim], exchangeRate[i_sim], today );

			}

			pm_sum[i_sim] += sum;

		}

	}

	// Write to global memory

	for( int i_sim = 0; i_sim < SIM_BLOCK_SIZE; i_sim++ ) {

		dm_answer[get_global_id(0) * SIM_BLOCK_SIZE + i_sim] = pm_sum[i_sim];

	}

}

Here is a test run with SIM_BLOCK_SIZE = 16, POS_BLOCK_SIZE = 64, n_positions = 2^17 and get_global_size(0) = 1. I use a throughput metric to evaluate the performance.

vwidth 1						

mpps (million positions per second): 1.595055

vwidth 2						

mpps (million positions per second): 2.550010

vwidth 4						

mpps (million positions per second): 3.553236

vwidth 8						

mpps (million positions per second): 3.566150

vwidth 16					   

mpps (million positions per second): 3.209080

As you can see, float4 and float8 perform best – more than twice as fast as float. But I can’t find a coherent explanation why. I don’t get a speed-up because of vector units, as the GPU doesn’t have any! What is it then? Are the calculations of a vector operation spread over multiple SIMT threads? Are the vector types scalarized and just share some overhead (as branches, memory access, common calculations, …)? Any ideas?

Thanks in advance for any pointers :-)

Cheers,

Ingo

Better memory access and loop unrolling. Looks like your writes are not coalesced, so whan you write more data, you get speedup. You need either get Fermi or rewrite data store for best performance.

Hi Lev,

Thanks a lot for your answer! Please view my comments below.

I though of that explanation. But I exclusively work on local and private memory. Coalescing happens for access to global memory, not local or private memory, right?

Furthermore, there is only one work-item. I know coalescing as grouping memory accesses of different work-items. How and when does coalescing work with vectors?

Just to make sure, here is some profile data of openclprof:

# OPENCL_PROFILE_LOG_VERSION 2.0

# OPENCL_PROFILE_CSV 1

# TIMESTAMPFACTOR fffff723820d3920

# OPENCL_DEVICE 0 Quadro FX 5600

gpustarttimestamp,method,gputime,cputime,occupancy,streamID,

memtransfersize,memtransferhostmemtype,branch,divergent_branc

h,instructions,warp_serialize,gld_incoherent,gld_coherent,gst

_incoherent,gst_coherent,local_load,local_store,tex_cache_hit

,tex_cache_miss

# float

11bffd882dfd37c0,varmc_kernel,1.35942e+06,1.35965e+06,0.042,

1,,0,12042257,0,46862558,0,368640,73728,30,4,2162704,131328,0

,0

# float2

11bffd94138258c0,varmc_kernel,1.53843e+06,1.53934e+06,0.042,

1,,0,12058641,0,60576093,0,368640,73728,30,8,2162704,262656,0

,0

# float4

11bffda7e85a0440,varmc_kernel,1.91607e+06,1.91695e+06,0.042,

1,,0,12058641,0,92164577,0,368640,73728,30,16,4325408,525312,

0,0

#float8

11bffdac6b2ba9a0,varmc_kernel,3.22732e+06,3.22824e+06,0.042,

1,,0,12058641,0,156488536,0,368640,73728,60,32,8650816,105062

4,0,0

# float16

11bffdb105f10f80,varmc_kernel,7.11611e+06,7.11757e+06,0.042,

1,,0,12058641,0,276780680,0,368640,73728,120,64,19615875,3313

936,0,0

The number of uncoalesced memory loads is constant (368640), the number of uncoalesced stores is insignificant (<=64).

The same phenomenon happens on my GeForce GTX 480. Here is an example run with the same parameters as above:

vwidth 1

mpps (million positions per second): 3.480973

vwidth 2

mpps (million positions per second): 6.374857

vwidth 4

mpps (million positions per second): 9.637150

vwidth 8

mpps (million positions per second): 12.202376

vwidth 16

mpps (million positions per second): 13.674197

Any other ideas?

Cheers,

Ingo

Need to know what is hot spot of the program. Another idea is better occupancy of GPU.
Btw, how do you change your block an grid size with using vectors?

The hotspot of the program is the loop starting with “for( int start = 0; start < n_positions; start += POS_BLOCK_SIZE ) {”, and more precisely the call to “EVALUATE_POSITION”.

Note that I intentionally have only one work-item, i.e. get_global_size(0) = 1. This allows me to understand better what’s happening inside one work-item. If I use many of these work-items, occupancy will increase to a normal level, but float4 will still be faster than float. My intention here is to understand why that is.

POS_BLOCK_SIZE and SIM_BLOCK_SIZE are macros.

It is bad approach.

Could you elaborate please?

Of course you will get performance speedup if run one workitem and perform more computations with it. GPU works with 1 item as fast as with 1000 items. Ruuning 1 work item is pointless.