950M faster than a 970M

I’m currently testing some code for a friend of mine as I have access to a bunch of different MXM modules and I got some odd results benchmarking some code between a 950M and a 970M

/*
 ============================================================================
 Name        : PCIe_Bandwidth.cu
 Author      : Matt
 Version     :
 Copyright   : Your copyright notice
 Description : CUDA compute reciprocals
 ============================================================================
 */

#include <iostream>
#include <numeric>
#include <stdlib.h>

static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)

/**
 * CUDA kernel that computes reciprocal values for a given vector
 */
__global__ void reciprocalKernel(float *data, unsigned vectorSize) {
	unsigned idx = blockIdx.x*blockDim.x+threadIdx.x;
	if (idx < vectorSize)
		data[idx] = 1.0/data[idx];
}

/**
 * Host function that copies the data and launches the work on GPU
 */
float *gpuReciprocal(float *data, unsigned size)
{
	float *rc = new float;
	float *gpuData;

	CUDA_CHECK_RETURN(cudaMalloc((void **)&gpuData, sizeof(float)*size));
	CUDA_CHECK_RETURN(cudaMemcpy(gpuData, data, sizeof(float)*size, cudaMemcpyHostToDevice));
	
	static const int BLOCK_SIZE = 256;
	const int blockCount = (size+BLOCK_SIZE-1)/BLOCK_SIZE;
	reciprocalKernel<<<blockCount, BLOCK_SIZE>>> (gpuData, size);

	CUDA_CHECK_RETURN(cudaMemcpy(rc, gpuData, sizeof(float)*size, cudaMemcpyDeviceToHost));
	CUDA_CHECK_RETURN(cudaFree(gpuData));
	return rc;
}

float *cpuReciprocal(float *data, unsigned size)
{
	float *rc = new float;
	for (unsigned cnt = 0; cnt < size; ++cnt) rc[cnt] = 1.0/data[cnt];
	return rc;
}

void initialize(float *data, unsigned size)
{
	for (unsigned i = 0; i < size; ++i)
		data[i] = .5*(i+1);
}

int main(void)
{
	static const int WORK_SIZE = 65530;
	float *data = new float[WORK_SIZE];

	float time;
	cudaEvent_t start, stop;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);

	for (int i = 0; i < 1000; i++)
	{
		cudaEventRecord(start, 0);

		initialize (data, WORK_SIZE);

		float *recCpu = cpuReciprocal(data, WORK_SIZE);
		float *recGpu = gpuReciprocal(data, WORK_SIZE);
		float cpuSum = std::accumulate (recCpu, recCpu+WORK_SIZE, 0.0);
		float gpuSum = std::accumulate (recGpu, recGpu+WORK_SIZE, 0.0);

		cudaEventRecord(stop, 0);
		cudaEventSynchronize(stop);
		cudaEventElapsedTime(&time, start, stop);

		/* Verify the results */
		std::cout << i << ") gpuSum = "<<gpuSum<< " cpuSum = " <<cpuSum<< " time = " <<time<< std::endl;

		delete[] recCpu;
		delete[] recGpu;
	}

	/* Free memory */
	delete[] data;

	return 0;
}

/**
 * Check the return value of the CUDA runtime API call and exit
 * the application if the call has failed.
 */
static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err)
{
	if (err == cudaSuccess)
		return;
	std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
	exit (1);
}

I ran the test 3 times on both cards and the 950M averaged around 7 seconds while the 970M averaged around 9 seconds. In addition, I ran the benchmark program that is included in the CUDA toolkit and only got 3GB/s H-D speeds on the 970 vs 10GB/s on the 950. Can anyone shed some light on this discrepancy? Thanks.

cross posting:

http://stackoverflow.com/questions/37196086/950m-faster-than-a-970m

Sounds like your 970M is not running correctly! You’ve pretty much already identified that yourself. A PCIE Gen3 x16 link should be able to support about 10GB/s H-D bandwidth.

But it’s literally brand new. Why would it be slower?

Perhaps its broken. Perhaps you’ve assumed you can successfully swap MXM modules when in reality you cannot.

Do you have some reason to believe you can drop any arbitrary MXM module into any arbitrary MXM slot, and everything will just work?

If you have a properly configured notebook computer, that originally shipped to you with a 970M MXM module, and upon testing that notebook, in the configuration it originally shipped in, you find that the module only supports ~3GB/s bandwidth, you might want to take that up with the manufacturer of the notebook. (Some manufacturers may declare that direct manipulation/removal/replacement of the MXM card violates warranty.) Otherwise, regarding other scenarios (e.g. I took this MXM module and moved it somewhere else) I think you may be making assumptions that are actually not the case.

Note that NVIDIA doesn’t sell MXM GPUs directly to end customers.

Also note the statement here:

http://www.notebookreview.com/howto/how-to-upgrade-nvidia-mxm-notebook-graphics-cards/

“Unless you’re buying the same MXM module to replace one that was already installed (or was offered) in your notebook, there’s no guarantee the module will work in your notebook.”

Even for a manufacturer that offers the “same” notebook configured with either a 950M or 970M MXM GPU, it’s possible that there are configurational differences outside of the MXM module itself (e.g. System BIOS, PSU, etc.) between those two different SKUs of the “same” notebook. Those configurational differences could be significant.

I’m not doing this on a laptop, it’s a COM Express carrier that my company manufactures. We actually sell the 970M along with our carrier board. I’m wondering what else could cause this transfer discrepancy besides the MXM being damaged.