One option would be to render each model against the material's individual leaf nodes into attachments, then use a composition graphics pipeline (combined with a separate sub render pass) to create the final image. Different materials can be numbered using an integer attachment, and they can be composed using a method simimlar to green-screen keying. Looking around, this seem to have a name: "deferred rendering".
GLSL doesn't support VLA as C99 does, so if it were to be done in shader code, there has to be a lot of custom codes. This is not flexible enough for me.
Please do refute me in the comment if there's any shortcoming of the above approach.