Right now, the answers seem to be “don’t use linked lists, use arrays instead”.
You can’t dereference a GPU pointer on the CPU, like you do in your second cudaMalloc call as the pointer is not in memory that can be gotten at by the CPU. There is not a good way of doing this without an allocation call on the video card itself…
There is a way to allocate memory on the CPU side of things, with cudaMalloc. There is not currently a way to allocate memory in a global or device function without writing your own malloc that uses the memory allocated by cudaMalloc as a source-pool.