atomicLoad in CUDA through PTX ISA

EternalSaga · August 5, 2017, 11:07pm

The purpose of this question is I need to prevent threads to load the shared data from caches.
I know that there is no explicit method of atomicLoad in CUDA. Two similar methods are using exsited atomic Read-Modify-Write functions and volatile key word. But I don’t want to bear the extra performance overhead through RMW functions. The volatile is not compatible with my current codes. As a result, I decide to use inline PTX ISA codes.
However, I know nothing about assembly codes. Could anybody tell me how to use the PTX ISA to implement the function of atomicLoad? Thank you very much.
Perhaps the prototype is like this. In addtion, the shared data may locate in both the same or different warps.

unsigned long long int atomicLoad(unsigned long long int* addr){
    asm("
    "
    );
}

Robert_Crovella · August 5, 2017, 11:29pm

perhaps now is the time to learn

You can do a global load (or store) while bypassing the L1 cache using the cache modifier .cg on a PTX ld (or st) instruction:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators

To build an inline PTX instruction, I suggest reading the inline PTX manual:

http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#axzz4opBHyGs7

There are examples of inline PTX assembly routines in many places if you care to look:

https://stackoverflow.com/questions/37149662/how-to-write-lop3-based-instructions-for-maxwell-and-up-nvidia-architecture

https://stackoverflow.com/questions/28881491/how-can-i-find-out-which-thread-is-getting-executed-on-which-core-of-the-gpu

as well as many example here on these devtalk forums.

EternalSaga · August 5, 2017, 11:57pm

However, I know nothing about assembly codes.

perhaps now is the time to learn

You can do a global load (or store) while bypassing the L1 cache using the cache modifier .cg on a PTX ld (or st) instruction:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators

To build an inline PTX instruction, I suggest reading the inline PTX manual:

http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#axzz4opBHyGs7

There are examples of inline PTX assembly routines in many places if you care to look:

https://stackoverflow.com/questions/37149662/how-to-write-lop3-based-instructions-for-maxwell-and-up-nvidia-architecture

https://stackoverflow.com/questions/28881491/how-can-i-find-out-which-thread-is-getting-executed-on-which-core-of-the-gpu

as well as many example here on these devtalk forums.

Thanks, txbob. I will try to solve it by myself.

EternalSaga · August 6, 2017, 1:18am

However, I know nothing about assembly codes.

perhaps now is the time to learn

You can do a global load (or store) while bypassing the L1 cache using the cache modifier .cg on a PTX ld (or st) instruction:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators

To build an inline PTX instruction, I suggest reading the inline PTX manual:

http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#axzz4opBHyGs7

There are examples of inline PTX assembly routines in many places if you care to look:

https://stackoverflow.com/questions/37149662/how-to-write-lop3-based-instructions-for-maxwell-and-up-nvidia-architecture

https://stackoverflow.com/questions/28881491/how-can-i-find-out-which-thread-is-getting-executed-on-which-core-of-the-gpu

as well as many example here on these devtalk forums.

Hello, txbob,
finally, I think these codes is suitable for my occastion, but there is a strange compiling error I don’t know why.

__device__ unsigned int ptxLoad(unsigned int* global) {
	unsigned int local;
	asm("ld.global.cg.u32 %0, %1;"
		:"=r"(local) : "r"(global));
	return local;
}

Severity	Code	Description	Project	File	Line	Suppression State
Error		asm operand type size(8) does not match type/size implied by constraint 'r'

Robert_Crovella · August 6, 2017, 3:01am

a pointer ( unsigned int *global) is an item that is 8 bytes on a 64-bit architecture.

You’re using an incorrect constraint letter (r) for an 8-byte quantity:

[url]Inline PTX Assembly :: CUDA Toolkit Documentation

EternalSaga · August 7, 2017, 7:19pm

I think the proper solution is like this:

__device__ inline unsigned long long int ptxAtomicLoad(unsigned long long int* global) {
	unsigned long long int local;
	asm("ld.global.cg.u64 %0, [%1];"
		:"=l"(local) : "l"(global));
	return local;
}

Topic		Replies	Views
Understanding PTX CUDA Programming and Performance	1	1035	December 11, 2019
Problem about inline PTX code in CUDA program CUDA Programming and Performance	3	2262	January 10, 2013
Help with Inline Assembly Syntax CUDA Programming and Performance	2	375	January 26, 2023
8.0 RC has new global load intrinsics with explicit cache modifiers CUDA Programming and Performance	0	2545	May 28, 2016
[Solved]CUDA inline PTX Internal Compiler Error CUDA Programming and Performance	2	1184	June 7, 2016
Is there proper CUDA atomicLoad function? CUDA Programming and Performance	2	726	February 19, 2022
Cache behavior when loading global data to shared memory in Fermi CUDA Programming and Performance	1	1061	April 30, 2013
How does cuda global memory's L1 caching work CUDA Programming and Performance	5	1124	July 12, 2024
predication in inline PTX CUDA Programming and Performance	2	2637	February 2, 2015
uncached read ptx failure CUDA Programming and Performance	0	1768	October 1, 2011

atomicLoad in CUDA through PTX ISA

Related topics