The result is broken when using release build.

I am using Tesla K40 (Compute Capability 3.5) and latest CUDA toolkit (7.5). I am running Ubuntu.

The following code is broken in release build but works under debug build:
To test the code just copy and paste the underlying code and on Nvidia Nsight and build it and run. Also ensure that the compiler is enabled for C++11.

#define Pindex 0
#define gpuE(ans) { gpuAssert((ans), __FILE__, __LINE__);}
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){
	if (code != cudaSuccess){
		std::cout<<"GpuError : "<<cudaGetErrorString(code)<<" in "<<file<<" at : "<<line<<std::endl;
using namespace std;

typedef struct {
	unsigned short year;
	unsigned char month;
	unsigned char day;
	unsigned char hour;
} fileData;

typedef struct {
	unsigned int packeddateandtime;
} packedfileData;

__device__   __host__ fileData unpackdata(packedfileData pfd){
	fileData fd;
	unsigned int pfdDT = pfd.packeddateandtime;
	fd.hour		=  pfdDT & 31;  	= (pfdDT & (31 << 5))	>> 	5;
	fd.month 	= (pfdDT & (15 << 10))	>> 10;
	fd.year  	=  pfdDT >> 14;
	return fd;

__device__   __host__  packedfileData  bitpackdata(fileData file_Data){
	unsigned int packeddateandtime = 0;
	packeddateandtime |= file_Data.hour;
	packeddateandtime |= << 5;
	packeddateandtime |= file_Data.month << 10;
	packeddateandtime |= file_Data.year << 14;
	packedfileData pf;
	pf.packeddateandtime = packeddateandtime;
	return pf;

__device__ __host__ void extract(unsigned int index, packedfileData* pfd,fileData* unpackeddata) {

	auto i = index;
	if (i == Pindex){
		auto m = pfd->packeddateandtime    ;
		printf("Got:%d \n", m);
	unpackeddata[i] = unpackdata(pfd[i]);
	if (i == Pindex) {
		auto m = (int) unpackeddata[i].month;
		printf("month:%d \n", m);

__global__ void extractGpu(packedfileData* pfd ,fileData* unpackeddata,unsigned int size) {
	unsigned idx = blockIdx.x*blockDim.x+threadIdx.x;
	if (idx < size){

int main(void)
	packedfileData pfd[1];
	fileData fd;
	fileData fda[1]; =21;
	fd.month = 4;
	fd.year = 2017;
	pfd[0] = bitpackdata(fd);
	auto H2D = cudaMemcpyHostToDevice;
	auto D2H = cudaMemcpyDeviceToHost;
	packedfileData* d_pfd;
	fileData* d_unpackeddata;
	auto totalsize = 1;
	gpuE(cudaMalloc(&d_pfd, totalsize * sizeof(packedfileData)));
	gpuE(cudaMalloc(&d_unpackeddata, totalsize * sizeof(fileData)));
	gpuE(cudaMemcpy(d_pfd, pfd, totalsize * sizeof(packedfileData), H2D));
	int extractblocks = totalsize / 64 + 1;
	cout << (int) unpackdata(pfd[Pindex]).month << endl;
	cout << pfd[Pindex].packeddateandtime  <<endl;
	extractGpu<<<extractblocks, 64>>>(d_pfd, d_unpackeddata, totalsize);

What is the expected and what is your actual output? Here is what I get, after building with CUDA 7.5 in both debug and release builds:


What is the exact invocation of nvcc that you used to build the code? FWIW I ran cuda-memcheck on the code and no issues are reported.

I was able to reproduce the issue. I have filed a bug report with NVIDIA.

The output I get in the failing case:

$ nvcc -std=c++11 -o t1184 warning: variable "D2H" was declared but never referenced warning: variable "D2H" was declared but never referenced

$ ./t1184

The output I get in the passing case:

$ nvcc -std=c++11 -o t1184 -G warning: variable "D2H" was declared but never referenced warning: variable "D2H" was declared but never referenced

$ ./t1184

In the failing case, the “month” output from GPU code is 0 but it should be 4.

I’ve tested this on CUDA 7.5, Fedora 20, Quadro5000 GPU, and also on CUDA 8.0RC, Centos7, Tesla K20X (add -arch=sm_35 to each compile command).

Thanks for the concise report.

Yes. The month ends up being 0 when it should have been 4.

This is the smallest code which reproduces the issue.

I built the code for sm_50 natively, also built PTX for sm_20 and sm_35 then JIT-compiled to sm_50 and don’t see the issue. This seems to suggest that the problem may be local to PTXAS which is responsible for machine-specific code generation from PTX intermediate code.

You may want to try adding -Xptxas -O0 to your nvcc invocation to see whether that makes the problem go away. Obviously this would be just a work-around, and it will cost some performance, but if it works it can tide you over until the compiler is fixed.

Yes, -Xptxas -O0 makes the problem go away, according to my testing.

It’s interesting that building for sm_50 does not show the issue.

As far as I am aware, PTXAS uses different code generator components for sm_2x, sm_3x, and sm_5x, so a PTXAS bug can certainly be specific to just one of these architecture families.