cutilSafeCall() Runtime API error: Invalid Argument

I’ve been pulling my hair over the cause of this error in a gmres cuda coding:

float *reduce_d ,*reduce;
	cutilSafeCall( cudaHostAlloc( (void**)&reduce,  512 *sizeof(float) ,cudaHostAllocMapped) ) ;
	for( int i=0; i<512; ++i ) reduce[i] = 0.0 ;
	cutilSafeCall( cudaHostGetDevicePointer( (void**) &reduce_d, (void*)reduce, 0 )  ) ;

	float *r0_d ;// r0=b-Ax
	cutilSafeCall( cudaMalloc( (void**)&r0_d , vecSize * sizeof(float) ) ) ;
	cutilSafeCall( cudaMemcpyAsync( r0_d+vecSize-2048, val_d+nzSize-2048, 2048 * sizeof(float) , cudaMemcpyDeviceToDevice, 0 ));

	float *w_d ;
	cutilSafeCall( cudaMalloc( (void**)&w_d , vecSize * sizeof(float) ) ) ;
	cutilSafeCall( cudaMemcpyAsync( w_d+vecSize-2048, val_d+nzSize-2048, 2048 * sizeof(float) , cudaMemcpyDeviceToDevice, 0 ));
	
	float *v_d;
	cutilSafeCall( cudaMalloc( (void**)&v_d ,( m + 1 ) * vecSize * sizeof(float) ) ) ;
	for( int i = 0 ; i < m + 1; ++i ){
		cutilSafeCall( cudaMemcpyAsync( v_d+vecSize*i-2048, val_d+nzSize-2048, 2048 * sizeof(float) , cudaMemcpyDeviceToDevice, 0 ));
	}

The error arises from the for loop at the last line of the above code… Would anyone please tell me where have I went wrong?

It would error if:

v_d+vecSize*i-2048 < 0

Or:

val_d+nzSize-2048 < 0

Also maybe try to simplify the problem by removing the cudaMempyAsync and using a cudaMempy as you can guarantee it will be in sync with the host

Actually it will error if

vecSize*i-2048 < 0

This means for i=0, you will get an error.

If v_d is a properly allocated device pointer, v_d - 2048 almost certainly is not.

Thanks, @cheinger , @txbob. Will try and update.

Hi @cheinger and @txbob: Thank you for the tip, your suggestion solves the problem.

On a separate issue, is it normal is my speed up is only 3 times faster then using the CPU only?

Thanks!

As long as your GPU code is correct meaning that your CPU and GPU code have the same results. It is also a sanity check strategy to make sure GPU code works well.

There are many factors determining the speedup between GPU and CPU. The speedup is defined as CPU_timing/GPU_timing. These factors include GPU and CPU models, the application you are working on (memory bound or compute bound), etc.

This thread has some discussion on the topic of CPU and GPU speed comparison.
https://devtalk.nvidia.com/default/topic/953975/sequential-code-is-faster-than-parallel-how-is-it-possible-/