COS Lite ingestion limits for 4cpu-8gb-ssd

sed-i · 11 April 2023 05:11

One of the goals for COS Lite is to be able to ingest considerable amount of data on modest hardware. Load testing is useful for gaining insight into how to size observability clusters appropriately.

Method

No resource limits set (for now).
MicroK8s 1.24, Juju 2.9.34.
Loki workload version was pinned to 2.4.1 (reason).
The latest results were obtained with the following charm versions:

App	Version	Charm	Channel	Rev
alertmanager	0.25.0	alertmanager-k8s	edge	64
cos-config	3.5.0	cos-configuration-k8s	edge	31
grafana	9.2.1	grafana-k8s	edge	76
loki	2.4.1	loki-k8s	edge	80
prometheus	2.42.0	prometheus-k8s	edge	119
scrape-config	n/a	prometheus-scrape-config-k8s	edge	39
scrape-target	n/a	prometheus-scrape-target-k8s	edge	25
traefik	2.9.6	traefik-k8s	edge	124

The load test is using terraform to provision:

COS Lite VM. Deployed from the cos-lite bundle with an overlay dedicated for the load test.
Avalanche VM. Used as scrape targets for prometheus. The number of targets and the number of metrics per target are adjustable.
Query VM. Used for “browser testing” with 20 “virtual eyeballs” looking at a grafana dashboard consisting of 600 log lines and 7200 graph points, refreshing every 5 sec.
Loki log VM. Used for pushing logs to loki from multiple virtual targets (“streams”).
Monitoring VM. Used for collecting metrics from the load test’s components for monitoring and producing performance data sheets.

(Edit a copy of this diagram)

Results

	Metrics only	Logs only	Metrics and Logs
Max scraped data points / min	1,710,000	-	1,200,000
Max ingested log lines / min	-	180,000	135,000
Storage (GB/day)	4	54	50

Note that the results above do not leave any leeway. For production you should probably add >10% margin.

Discussion

VM resource usage

Load tests that run successfully for over 6h without incidents are marked as “PASSED”, and are used for constructing the datasheet. Passing tests are also used for curve fitting an approximation for resource usage.

In the diagram below:

Numbers in square boxes represent passing tests.
Red “X” represent failed tests.
Gray dashed contours are the resource consumption estimation, based on curve fitting a linear function in two variables (y = a1 * x1 + a2 * x2 + b) to all the tests.
Some data points appear inconsistent with the trend. This is most likely due to changing conditions from test to test, such as: new charm release, compaction taking place close to recording final test data, or timing alignment of the querying browser instances producing a spike. As charms and juju stabilize, the numbers are expected to be more consistent in future tests.
Log lines ingestion rate is CPU-limited, and and metrics ingestion rate is memory-limited.

Per-pod resource usage

In addition to the overall VM resource usage, we also look at the cpu and memory usage per pod (reported by kubectl top pod).

We need 0.8 CPU and 1.4 Gi memory for every 100,000 log lines per minute ingested by Loki.
We need 0.12 CPU and 2.3 Gi memory for every 1,000,000 metrics datapoints per minute ingested by prometheus.

Pod	CPU	Memory
Loki	0.75 / 100k	1.3Gi / 100k
Prometheus	0.11 / 1M	2.3Gi / 1M

Table: pod resources needed for per-minute ingestion rates.

Conclusions

The Grafana Labs stack that COS Lite is based on can ingest a substantial amount of data on fairly moderate hardware.
Being unable to deploy the same charm revisions (workload versions) every single load test run introduced some jitter in the results.
More load tests are needed to improve the prediction accuracy of the fitted curves.

Data

https://github.com/canonical/cos-lite-bundle/blob/feature/custom-exporter/tests/load/gcp/var_ssd-4cpu-8gb.csv

Code

https://github.com/canonical/cos-lite-bundle/pull/65

mthaddon · 11 April 2023 10:54

Can you expand on this a bit. What’s the issue here?

sed-i · 11 April 2023 15:40

It’s a general “version pinning” comment.

For example, during the entire load test activity so far:

we added self-monitoring to all the charms, which ended up adding ~12k datapoints per minute
grafana image switched to v9 (other workload versions changed too, why may have resulted in performance differences from previous runs)

Since the bundle specifies “channel: edge”, the load test picks up whatever is currently there.

mthaddon · 11 April 2023 15:48

Understood, thanks

gustavosr98 · 29 November 2023 16:25

Is it 4 CPU 8GB RAM total or per kubernetes worker? How many kubernetes workers were used?

sed-i · 29 November 2023 16:59

4cpu8gb vm for the entire cos-lite deployment.

0x12b · 12 December 2023 13:11

well, we’re only using one vm in the tests, aren’t we? in a three node cluster, it would be per node if I’m not completely mistaken?

sed-i · 12 December 2023 14:07

For sure. I may have misunderstood @gustavosr98

gustavosr98 · 12 December 2023 16:38

Got it, thank you both @sed-i @0x12b