Data Structures on the device

I had a debate today with some of my colleagues about what the best data structure would be to load 10k arrays of length 512 into CUDA memory.
The goal is to make these one dimensional arrays searchable.
so they need to be easily traversal.

We couldn’t agree on one answer so here is the question to everyone.

What would you use to load 10k arrays of length 512 into CUDA memory?
array[10000][512], linked list, vectors etc… ?

anything?

Allocate a simple 1D array with cudaMallocPitch(). Which direction you make rows and columns depends on how the threads will access the data. You say you want the short arrays searchable, so you probably want 512 rows and 10000 columns. A block could load a row into shared memory and search it from there.