performance gap between host allocated halfs and kernel casted halfs

Robert_Crovella · January 22, 2020, 8:35pm

For a complete test case, I would like to see an entire code (all host code and device code needed to build a complete application, without me having to add anything or change anything - copy, paste, compile, run) along with the platform (OS, GPU) you are running on, as well as the complete compile command.

If any of that is missing, I’m less likely to spend any time on it. For example, I wouldn’t want to waste time trying to analyze code, only to discover that OP is compiling a debug instead of release project, and trying to do performance analysis on debug code.

At first glance, your codes look different to me because in one case you are loading the a and b quantities exactly once (the fast case) and storing c exactly once, and in the other case you are (in source code, at least) loading the a and b quantities multiple times and potentially storing c multiple times. To state equivalence between the two presumes things about the compiler that I’m not sure are always true.

I also hate to try and analyze artificial code, like the loops of 1000. I don’t know what guesses the compiler will be doing under the hood. To analyze performance best, my suggestion would be to just work on large data sets, rather than artificially increasing the work by 1000. The compiler might discover things about your loop of 1000 where all the data is per-thread local data, that it cannot/does not discover for the case where some of the data is global data.

Topic		Replies	Views
error when trying to use half (fp16) CUDA Programming and Performance	16	20290	October 13, 2015
How to convert floats into halfs on NVidia's implementation? CUDA Programming and Performance	1	10258	July 15, 2010
eficient conversion from and to HALF CUDA Programming and Performance	1	1525	October 24, 2008
Half Float and Fermi CUDA Programming and Performance	1	4231	October 23, 2009
Almost no performance improvement with half2 Legacy PGI Compilers	4	2025	April 9, 2020
Test of new 16 bit float half type in CUDA 7.5 CUDA Programming and Performance	12	5377	June 7, 2016
Load and store half-floats from device memory. How to shift those bits correctly... CUDA Programming and Performance	8	2504	May 28, 2009
Float type performance comparisons CUDA Programming and Performance	2	5283	June 25, 2007
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	1147	October 16, 2023
Fast way (on device) to convert from byte to float CUDA Programming and Performance	6	11680	August 20, 2007

performance gap between host allocated halfs and kernel casted halfs

Related topics