It would be good to know what version of Juju you are running. We do have our production controller that runs stably at approximately the same size (1100 total machines, 260 models, 6000 total units). When you say 5-10 models, 300 machines, is that 10 * 300 machines = 3000 machines, or is it 300 machines distributed across 10 models?
We did do several fixes in the 2.5 series that helped with stability for our production controller. (It has been running 2.5.1 stably since it came out, though we are looking to upgrade to 2.6.8 in the next week.)
Random disconnects sounds like something like an MTU mismatch.
Model not found seems like a much more significant issue, so that seems a bit version dependent (is there a known bug in that version of juju that might see that error).
Another recommended practice if you are going to run juju controllers at scale is to set up Prometheus and Grafana:
That can often give very interesting insights into running operations, and what is going on. (What API calls are taking a long time, what database operations are being a problem, etc.)
I know we’re a bit busy at a conference this week, but likely we’d be interested to see the production system and see if there is anything we can help with.