Problem with streaming load and store on FERMI

I am trying to implement basic intrinsic function for streaming load/store operation for floating point and integer numbers. These functions are based on the examples given in the Inline PTX in CUDA guide and Cache operator PTX instruction givein Version 2.3, Section, Page 110).

__device__  inline int load_streaming_int(int* addr)


	int res;

	asm(" %0, [%1];" : "=r"(res) : "l"(addr));

	return res;


__device__  inline void store_streaming_int(int* addr, int val)


	asm(" [%0], %1;" : "+l"(addr): "r"(val));


I ran a simple example kernel (given below) and used the NVIDIA profiler for measuring the load/store performance. Strangely, the profiler counters “uncached global load transaction” is always zero. I was expecting a non-zero value for this counter. I checked the generated PTX code and they seem to use the st.cs and ld.cs (cache streaming) instructions.

__device__  int d_global_mem[NUM_VALUES];

	    int h_global_mem[NUM_VALUES];

__global__ void test_streaming()


	int gid=threadIdx.x+blockIdx.x*blockDim.x;

	int nthreads = blockDim.x*gridDim.x;

	for (int i = gid; i < NUM_VALUES; i+= nthreads) {

		int val = load_streaming_int(&d_global_mem[i]);

		val = val*val;

		store_streaming_int(&d_global_mem[i], val);



Any suggestions would be appreciated.

emmm… maybe you could test the performance difference? I personally tend to think of the profiler as a big liar for most of the time.

I have attached the test program in case anyone wants to try out the approach (2.2 KB)

Any specific reason why you think profiler is a big liar ??

The profiler uses hardware based run-time counter that is initialized for each kernel and averaged for multiple runs.

I don’t see a reason to disbelieve the values reported by cuda-profiler.

I remember someone reporting in another thread that the profiler gave some very weird numbers about cache statistics.

It might be possible that the profiler doesn’t count cache streaming as uncached access. Maybe “global uncached access” only refers to the use of 32-byte cacheline, when L1 is disabled. I believe cache streaming still makes use of the default 128-byte cacheline on fermi.