Tx1 against jetson NANO

I am currently trying to benchmark the Jetson TX1 against the jetson NANO, according to https://elinux.org/Jetson, they both have the maxwell architecture with 128 cuda cores for NANO and 256 for TX1. This means that normally Jetson NANO will achieve half the performance of the TX1.

To test this, I created a single (float) operation multiplication kernel as follows:

__global__ void	matrixMultiply(float* mat1, float* mat2, int nx, int ny)
	unsigned int ix = threadIdx.x + blockDim.x*blockIdx.x;
	unsigned int iy = threadIdx.y + blockDim.y*blockIdx.y;
	int idx = iy*nx + ix;

	mat1[idx] = mat1[idx]*mat2[idx] ;


Test : the multiplication of 2 “float array of size 15000*15000” resulted for TX1 = 130 ms and Jetson NANO = 150 ms. The result seems weird, it’s like I am not using the second SM of TX1,
therefore I profiled using sm_efficiency (TX1 and NANO = 100%) , achieved_occupancy (TX1 = 92%, NANO = 88 %)
am I missing something here or I just don’t use the proper grid and block configuration.

P.S: I tried all possible configuration and the best configuration for both platforms was a block of (256, 1).


The result is under expectation.
Nano is using CUDA 10.0 but TX1 is still in CUDA 9.0.