AFAIK Julia uses patched LLVM's PTX output, which I think should be done by all languages to work towards a common optimization platform. Also CuArrays uses multiple higher level NVIDIA libraries, like CudaBLAS and CuDNN.
The goals look similar to me, so it's worth to take a look at them.
Yes, I was planning to start with what NumBa (http://numba.pydata.org/) is doing, they also use the LLVM PTX backend.
There is a really promising new project by Chris Lattner (the original author of LLVM) called MLIR: https://github.com/tensorflow/mlir. That might be the best intermediate representation that all the compilers (Julia, Fortran, ...) could target.
The goals look similar to me, so it's worth to take a look at them.