OK, I think I found a good enough balance between performance and pop-ins that did improve framerate for me by 20% vs. not doing occlusion/culling at all in a simple scene - should be even better in a complex one. Maybe it will help someone or maybe you will further help me improve it / correct it by answers in this thread?
Initialization:
- On mesh creation/loading I generate a dedicated AABB mesh for the parent mesh.
- I create a dedicated ID3D11Query object for each parent mesh with D3D11_QUERY_OCCLUSION desc and have each parent mesh object have an UINT64 query result variable.
- I add a special UINT m_framesElapsed variable to introduce the delay between querying and getting the results because getting the results in the same frame obliterates performance
Mesh objects have their class methods BeginQuery(), EndQuery(), GetQueryResult(). BeginQuery() and EndQuery() execute only on m_framesElapsed 0 and GetQueryResult() executes on m_framesElapsed 2 (the lowest I could go in my case), if it's not 2 - it would increase frame count by one and reset it to 0 if it's 2.
Now in the main rendering pass I first render all high detailed meshes as usual, straight into both render and depth buffers - then in the occlusion path down the line I render AABB meshes only to depth buffer and query them against the high detailed meshes there: BeginQuery(), DrawCall(), EndQuery(), GetQueryResult() - and write the result into that high detailed mesh's UINT64 variable so next frame it will either be shown or get occluded. The good effect here is that because AABBs take more space than high detailed mesh due to being boxes - they introduce a margin of error that helps with pop ins / pop outs as a high detailed mesh will be guaranteed to be fully occluded by the time AABB is fully occluded.
However because of 2 frames delay - there are still notable pop-ins on multi-node/multi-mesh complex models which right now I don't know how to deal with or how to eliminate (maybe cheat by increasing the scale/size of AABBs?), but the performance increase is quite notable.
Another thing I'm thinking about is before doing an AABB occlusion pass I'd copy a full-res depth buffer, downscale it, output it to a new, much smaller resolution depth buffer and then do AABBs rendering/occlusion testing pass against that buffer. Maybe it will save even more performance as copying/resizing is basically free compared to querying a buffer that has 4 times the pixels, hmm?
It's an interesting study case either way.
<=test rather than a<. \$\endgroup\$