Hi everyone!
I’m working on the performance analysis of a simple Monte Carlo application for financial risk evaluation. I’m observing a significant speed-up when using vector types. I cannot explain this observation, as my GPU (Quadro FX 5600, G80) does not have vector units. Does someone has an idea of how to explain this?
The applications calculates ~100k identical simulations. Each simulation consists of generating random parameters (param) and an analysis of data under these parameters, i.e. a big sum of many sum += f( data[i], param ). I use vectors of different sizes to execute several simulations at once, i.e. sum and param are vectorized while data is not.
Furthermore, I use a blocked single buffer scheme to acces data, and I can vary the number of simulations executed by one work-item (in order to share global memory access). If there are further questions of what the algorithm does, I’ll be glad to answer them. Here are the most important parts of the code:
#if SIZE==1
#define FLOAT float
#define FIXED uint
#define SFIXED int
#define STRIDESHIFT 2
#define SELECT(_a, _b, _c) (_c ? _b : _a)
#define CONVERT_FLOAT convert_float
#endif
#if SIZE==2
#define FLOAT float2
#define FIXED uint2
#define SFIXED int2
#define STRIDESHIFT 3
#define SELECT(_a, _b, _c) bitselect(_a, _b, as_float2(_c))
#define CONVERT_FLOAT convert_float2
#endif
// and so on for SIZE = 4 8 16
__inline FLOAT EVALUATE_POSITION( int type,
int dayOfMaturity,
FLOAT stockIndexValue,
FLOAT foreignInterestRate,
FLOAT exchangeRate,
int today ) {
switch( type ) {
case POSITION_TYPE_STOCK:
return stockIndexValue;
case POSITION_TYPE_FBOND:
return exchangeRate * 100 *
exp( - ( foreignInterestRate / 100 ) * CONVERT_FLOAT( (SFIXED)( dayOfMaturity - today ) ) );
default:
return FLT_MAX;
}
}
__kernel void
varmc_kernel( int n_simulations,
int n_positions,
__constant const float *dm_sqrtAddCovar,
__constant const float *dm_muPlusParam,
__global const int *dm_type,
__global const int *dm_dayOfMaturity,
__global const int *dm_amount,
__global FLOAT *dm_answer ) {
// Calculate simulated market parameters
FLOAT stockIndexValue[SIM_BLOCK_SIZE], foreignInterestRate[SIM_BLOCK_SIZE], exchangeRate[SIM_BLOCK_SIZE];
gen_market_params( stockIndexValue, foreignInterestRate, exchangeRate,
dm_sqrtAddCovar, dm_muPlusParam );
float today = dm_muPlusParam[N_MARKET_PARAM];
// Calculate sum of position values
__private FLOAT pm_sum[SIM_BLOCK_SIZE];
for( int i_sim = 0; i_sim < SIM_BLOCK_SIZE; i_sim++ ) {
pm_sum[i_sim] = ZERO;
}
for( int start = 0; start < n_positions; start += POS_BLOCK_SIZE ) {
int block_size = min( POS_BLOCK_SIZE, n_positions - start );
// Prefetch data for the next POS_BLOCK_SIZE positions
__local int sm_type[POS_BLOCK_SIZE], sm_dayOfMaturity[POS_BLOCK_SIZE];
event_t event;
event = async_work_group_copy( &(sm_type[0]), &(dm_type[start]), (size_t)block_size, (event_t)0 );
event = async_work_group_copy( &(sm_dayOfMaturity[0]), &(dm_dayOfMaturity[start]), (size_t)block_size, event );
wait_group_events( 1, &event );
// Calculate next POS_BLOCK_SIZE positions
for( int i_sim = 0; i_sim < SIM_BLOCK_SIZE; i_sim++ ) {
FLOAT sum = ZERO;
for( int i = 0; i < block_size; i++ ) {
sum += EVALUATE_POSITION( sm_type[i], sm_dayOfMaturity[i],
stockIndexValue[i_sim], foreignInterestRate[i_sim], exchangeRate[i_sim], today );
}
pm_sum[i_sim] += sum;
}
}
// Write to global memory
for( int i_sim = 0; i_sim < SIM_BLOCK_SIZE; i_sim++ ) {
dm_answer[get_global_id(0) * SIM_BLOCK_SIZE + i_sim] = pm_sum[i_sim];
}
}
Here is a test run with SIM_BLOCK_SIZE = 16, POS_BLOCK_SIZE = 64, n_positions = 2^17 and get_global_size(0) = 1. I use a throughput metric to evaluate the performance.
vwidth 1
mpps (million positions per second): 1.595055
vwidth 2
mpps (million positions per second): 2.550010
vwidth 4
mpps (million positions per second): 3.553236
vwidth 8
mpps (million positions per second): 3.566150
vwidth 16
mpps (million positions per second): 3.209080
As you can see, float4 and float8 perform best – more than twice as fast as float. But I can’t find a coherent explanation why. I don’t get a speed-up because of vector units, as the GPU doesn’t have any! What is it then? Are the calculations of a vector operation spread over multiple SIMT threads? Are the vector types scalarized and just share some overhead (as branches, memory access, common calculations, …)? Any ideas?
Thanks in advance for any pointers :-)
Cheers,
Ingo