[CCC DEV] NVIDIA CUDA compiler?

Thu Dec 15 11:04:33 GMT 2011

Hi,

Matt Jadud writes:
>
> http://www.phoronix.com/scan.php?page=news_item&px=MTAyODI
> http://www.phoronix.com/scan.php?page=news_item&px=MTAyNjk
>
> I think Carl made some initial explorations in transforming ETC to
> LLVM bytecodes... would a CUDA backend to LLVM be a high-value
> proposition in terms of investing effort/energy in supporting a
> new/improved backend target?

The ETC -> LLVM translator was mostly fully working, and might still be in
full-use on some MacOS targets.  I recall some problems were introduced from
the LLVM side of things which broke something or other, but I don't know how
broken that is.

I've got two final-year students currently exploring GPU usage in occam-pi.
At the moment this is done by calling out to chunks of CUDA and OpenCL code
(in the external C calls way) rather than anything fancy like targeting the
GPU from ETC->LLVM.  Whether there is significant benefit in that approach
I don't know -- my gut feeling is that your milage may vary.  The GPU, like
all SIMD architectures, requires programs to be written in a particular
way to get the most benefit -- i.e. data parallel.  The process-oriented
approach tends to encourage breaking things down MIMD style.  To get that
back to SIMD we'd probably want to collect all the math-stuff together
again in a single process to feed the GPU, which nullifies the process-
oriented approach somewhat (except for its nice design properties).

I had been thinking about half-way-house abstractions for getting some
performance out of a GPU in a nice process-oriented way.  For instance,
some "sums" PROC which acts as a mathsy server for a bunch of other processes,
and can collect together enough similar work to make GPU execution worthwhile.
Though that would add some latency, it might not be a huge issue if there's
enough work to maintain good throughput.

A more interesting question is what GPU-specific optimisations a CUDA
back-end might employ -- i.e. whether it'll try and pull together fragments
of maths in a meaningful way to get SIMD performance, or whether it'd just
do a brainless transformation and get little in the way of performance over
a CPU.

A cheap investigation could be to have some core-math-stuff of occam-pi
PROCs compiled into CUDA via LLVM and interfaced appropriately (occam
FUNCTIONs would be the most obvious candidate as these don't deschedule
or anything messy like that) and see how performance fares.

    [128]REAL32 FUNCTION compute.mandelbrot.line (VAL REAL32 x, y, zoom)
      ... sums
    :

    #PRAGMA CUDA compute.mandelbrot.line

Rigging the existing compiler to handle that and drop suitable markers in
the ETC wouldn't be too hard -- then Carl's ETC-LLVM tool can scoop-up and
handle (and leave some suitable call stub for the occam-pi world).  Might
be easier to play with this in the Transterpreter, at least for initial
proddings.

Thinking out loud, feel free to tell me to stop ;)

-- Fred
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 185 bytes
Desc: not available
URL: <http://www.concurrency.cc/pipermail/developers/attachments/20111215/0da63414/attachment.pgp>