We are developing an adaptive filter algorithm to run on the GPU. We need to create a Toeplitz matrix using a subsection of a data vector on the device. With the current implementation of the cuBlas functions we need to write kernel code to do this efficiently. We have assumed that the pointer to the object in GPU memory which cublasAlloc() returns can be manipulated in the kernel function just like an object allocated by cudaMalloc. We tested this theory by:

Using cublasAlloc and cublasSetMatrix() to configure a matrix on the device to a known set of values.

Call our kernel routine to manipulate the matrix on the device.

Use cublasGetMatrix to return the matrix to the host and compare it with our “gold” result.
This test runs successfully on the graphics card so we are assuming that the pointer which cublasAlloc returns is in the same type of memory that cudaMalloc would return. Our concern is that in the cuBlas documentation we did not find anything saying that it is legal to manipulate the cuBlas matrix pointer in this way and that in a later cuBlas version this may not work if the underlying cuBlas allocation scheme is changed. Does anyone have any info on this subject? Thanks!
–Paul