Hey njuffa,
Finally got around to testing my dotprod issue. It’s more sinister than I thought:
The dotproduct itself is not the issue. This performs well enough - roughly 1% of total execution time of an iteration of my code. What DOES happen is the rest of my code slows down when using cublas’ dot product instead of my custom kernel - and more importantly, each execution of exactly the same code becomes much much much more variable
I do have one question - are repeated initializations of cublas/cusparse handles not recommended?
Here’s the code I’m using. darray is a simple device array class. My own generalized dotproduct code is commented out, obviously.
darray gendot(const darray &vec1, const darray &vec2)
{
int m = vec1.dims0;
int n = vec1.dims1;
darray out(1,n);
if ((vec1.dims0==vec2.dims0)&&(vec1.dims1==vec2.dims1))
{
cublasHandle_t handle;
CUBLAS(cublasCreate(&handle));
CUBLAS(cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_DEVICE));
for (int i=0; i<n; i++) cublasSdot(handle, m, vec1.data+i*m, 1, vec2.data+i*m, 1, out.data+i);
CUDA(cudaDeviceSynchronize());
CUBLAS(cublasDestroy(handle));
//dotprod_run(out.data,vec1.data,vec2.data,m,n);
} else {
MSG("Gendot error dims: %d,%d * %d,%d", vec1.dims0,vec1.dims1,vec2.dims0,vec2.dims1);
throw std::runtime_error("gendot -> dimension mismatch!");
}
return out;
}
Time for iteration 1 : 0.796480 seconds, accuracy so far : 1.163859
Time for iteration 2 : 0.759783 seconds, accuracy so far : 0.497118
Time for iteration 3 : 1.110367 seconds, accuracy so far : 0.391392
Time for iteration 4 : 0.954960 seconds, accuracy so far : 0.154466
Time for iteration 5 : 0.815792 seconds, accuracy so far : 0.093269
Time for iteration 6 : 1.044942 seconds, accuracy so far : 0.084907
Time for iteration 7 : 0.891542 seconds, accuracy so far : 0.036288
Compared to:
Time for iteration 1 : 0.724610 seconds, accuracy so far : 1.163859
Time for iteration 2 : 0.697980 seconds, accuracy so far : 0.497118
Time for iteration 3 : 0.703218 seconds, accuracy so far : 0.391392
Time for iteration 4 : 0.662221 seconds, accuracy so far : 0.154466
Time for iteration 5 : 0.700503 seconds, accuracy so far : 0.093268
Time for iteration 6 : 0.656961 seconds, accuracy so far : 0.084907
Time for iteration 7 : 0.728399 seconds, accuracy so far : 0.036288
Of this, 15 ms is taken up by sdot per iteration in version 1, compared to 5 ms or so in version 2 (this includes cudaDeviceSynchronize() before starting and stopping the timer. BOTH are negligible which makes me wonder why the rest of the code gets wobbly with sdot??