Memory copy to GPU 1 is slower in multi-GPU

zenna · April 1, 2010, 12:53pm

My company has a setup of two GTX 295, so a total of 4 GPUs in a server, and we have several servers.

We GPU 1 specifically was slow, in comparison to GPU 0, 2 and 3 so I wrote a little speed test to help find the cause of the problem.

[codebox]//#include <stdio.h>

//#include <stdlib.h>

//#include <cuda_runtime.h>

#include

#include <cutil.h>

global void test_kernel(float *d_data) {

int tid = blockDim.x*blockIdx.x + threadIdx.x;

for (int i=0;i<10000;++i) {

	d_data[tid] = float(i*2.2);

	d_data[tid] += 3.3;

}

}

int main(int argc, char* argv)

{

int deviceCount;                                                         

cudaGetDeviceCount(&deviceCount);

int device = 0; //SELECT GPU HERE

cudaSetDevice(device);

cudaEvent_t start, stop;

unsigned int num_vals = 200000000;

float *h_data = new float[num_vals];

for (int i=0;i<num_vals;++i) {

	h_data[i] = float(i);

}

float *d_data = NULL;

float malloc_timer;

cudaEventCreate(&start);

cudaEventCreate(&stop); cudaEventRecord( start, 0 );

cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice);

cudaMalloc((void**)&d_data, sizeof(float)*num_vals);

cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &malloc_timer, start, stop );

cudaEventDestroy( start );

cudaEventDestroy( stop );





float mem_timer;

cudaEventCreate(&start);

cudaEventCreate(&stop); cudaEventRecord( start, 0 );

cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice);

cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &mem_timer, start, stop );

cudaEventDestroy( start );

cudaEventDestroy( stop );

float kernel_timer;

cudaEventCreate(&start);

cudaEventCreate(&stop); cudaEventRecord( start, 0 );

test_kernel<<<1000,256>>>(d_data);

cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &kernel_timer, start, stop );

cudaEventDestroy( start );

cudaEventDestroy( stop );

printf("cudaMalloc took %f ms\n",malloc_timer);

printf("Copy to the GPU took %f ms\n",mem_timer);

printf("Test Kernel took %f ms\n",kernel_timer);

cudaMemcpy(h_data,d_data, sizeof(float)*num_vals,cudaMemcpyDeviceToHost);

delete[] h_data;

return 0;

}[/codebox]

The results are

GPU0

cudaMalloc took 0.908640 ms

Copy to the GPU took 296.058777 ms

Test Kernel took 326.721283 ms

GPU1

cudaMalloc took 0.913568 ms

Copy to the GPU took 663.182251 ms

Test Kernel took 326.710785 ms

GPU2

cudaMalloc took 0.925600 ms

Copy to the GPU took 296.915039 ms

Test Kernel took 327.127930 ms

GPU3

cudaMalloc took 0.920416 ms

Copy to the GPU took 296.968384 ms

Test Kernel took 327.038696 ms

As you can see, the cudaMemcpy to the GPU is well double the amount of time for GPU1. This is consistent between all our servers, it is always GPU1 that is slow. Any ideas why this may be? All servers are running windows XP.

Tiberius · April 1, 2010, 6:32pm

I am not familiar with some of the timing code you are using - maybe it is something from a newer library than what I have spent time with. I assume that you have confirmed that code follows all of the necessary guidelines for performing timing tests. If not, search the forum for those guidelines.

If the problem were on a single server, I would suggest checking that the card itself is OK. I have had a number of cards go bad over the last 1.5 years. Since it is consistent across several of these machines (at least that is what I interpret from your note), I would double check your hardware configuration. Is the particular GPU in a 4x or 8x slot while the others are in a 16x slot? This could certainly account for the difference.

Sarnath · April 5, 2010, 11:20am

I would guess this is a BIOS issue – since you said all servers have this problem…
Check the BIOS option and see if something needs to be enabled for the SLOT in which GPU1 is placed…
Also check the PCI-e memory lane width

Topic		Replies	Views
DATA tranfer from CPU to GPU CUDA Programming and Performance	6	4826	April 23, 2008
Inconsistent cudaMemcpy Timing cudaMemcpy and kernel timing hiccups at 1 second intervals CUDA Programming and Performance	1	1099	October 6, 2010
Questions about cudaMalloc Questions about runtime for cudaMalloc and cudaMemcpy CUDA Programming and Performance	1	3348	June 23, 2009
cudaMemcpy on 9800GTX2 CUDA Programming and Performance	10	11643	December 18, 2008
The time spent on cudaMemcpy when copying data from CPU to GPU occasionally fluctuates significantly CUDA Programming and Performance	5	61	October 31, 2024
copy memory slow? CUDA Programming and Performance	2	4811	February 12, 2009
Simple proven (timed) example code where GPU beats CPU, anyone? CUDA Programming and Performance	6	1165	November 1, 2013
Running 2 GPUs on CUDA Low Speed Memory Copy Operations Problem CUDA Programming and Performance	0	1461	August 7, 2007
multidevice got slower than single gpu device CUDA Programming and Performance	14	2742	October 23, 2010
CUDA trouble CUDA Programming and Performance	3	992	March 19, 2013

Memory copy to GPU 1 is slower in multi-GPU

Related topics