Optimize Colour MLS
Currently by far the slowest component and yet produces the best results. Specifically the cross support version is slow, probably due to poor memory coalescing. The MLS code could be split into multiple kernels and memory accesses could be better synchronised.
Edited by Nicolas Pope