if my sample is interleave of 50k samples, whats the best to manipulate samples so all the even index sample is allocate to 1 part of cuda ram, same for odd sample?
You can handle this on the copy from host to device if you wish, using cudaMemcpy2D.
However this will be relatively slower.
The best way is probably to just copy it to the device as-is, then launch a kernel on the device to reorganize the data however you wish. This pushes the data movement/reorganization to the device, which generally has the highest memory bandwidth in the system.
You can do this pretty easily in thrust with permutation iterators or a construct built on a permutation iterator called a strided range iterator: