Pascal and win10 random memory access :(

Hi.
I use cuda driver api. (cuda.lib)
I wrote a little code into PTX with random memory access…
And I was struck by a very slow work!
On Win7 this code runs 10 times faster.
What could be the problem?
win10 build 1607, nvidia driver 372.90, GTX 1060

mov.u32  	round,0x00;
	$LLBfnv1:
	
	
	xor.b32		mem,rS,round;		
	mul.lo.u32	mem,mem,0x01000193;
	xor.b32		mem,mem,mem4;		
	
	rem.u32  mem, mem,n_c;			
	shl.b32  mem,mem,7;				
	add.u32  mem2,dagpointer,mem;	
	
	
	ld.global.cg.v2.u32 	 {mem,mem2},[mem2];

	
	
	mul.lo.u32 		temp,temp,0x01000193;
	xor.b32 temp,temp,mem;	
	mul.lo.u32 		temp2,temp2,0x01000193;
	xor.b32 temp2,temp2,mem2;	
	
	
	add.u32     round,round,1;
	setp.lo.u32 p,round,64;               
	@p bra.uni $LLBfnv1;

If i remove random part, and access only to one memory adress - all work fine.

mov.u32  	round,0x00;
	$LLBfnv1:
	
	
	
	ld.global.cg.v2.u32 	 {mem,mem2},[dagpointer];

	
	
	mul.lo.u32 		temp,temp,0x01000193;
	xor.b32 temp,temp,mem;	
	mul.lo.u32 		temp2,temp2,0x01000193;
	xor.b32 temp2,temp2,mem2;	
	
	
	add.u32     round,round,1;
	setp.lo.u32 p,round,64;               
	@p bra.uni $LLBfnv1;

Why random access slows down execution?

Do you have any measurements?

yes i have.
result at windows7: 20M execution

result at windows10: 4M execution

This isn’t specific to your case but random accesses in CUDA are usually the kiss of death with regards to performance because of the global load size. There could be minute differences between each OS and these only show up in the case of disparate reads.

This is an interesting thread because the goal is to usually try and avoid as many random accesses in CUDA as possible.

The issue is due to WDDM 1.x vs WDDM 2.x. WDDM 2.x is what Win 10 uses, while Win 7/8 uses WDDM 1.x.

Leave it to MSFT to screw up a good thing, but I have heard there is a way to use WDDM 1.x within the confines of Win 10, but have not done so myself.

One of the systems at my work uses Win 10, but has a Maxwell Titan X and uses the TCC driver and that configuration does not have this random access issue.

Look…
There are programs for mining, which use gpu resources.
And some of these programs deal with this problem (random accesses memory). And if they run, they show 20 million execution. I think that they use CUDA RUNTIME API. And I use CUDA DRIVER API.

It seems that Nvidia just something unfinished on drivers under windows10

CudaaduC, that’s good to know. Windows 10 stuff is kind of wonky right now. The irony is, I left Ubuntu for Windows because I wanted better driver support! The irony is not lost on me… I curse every OS that I use.