I forgot to say that I tried using separate threads for the gc (it is a natural for this in that the most compute-time expensive part is just spinning through the heap -- it can easily be split into N threads). But I got no speed up overall from this -- there's a brief comment in s7.c about it ca line 7079.