[specification] ISD075 GitHub Runner COS integration

bartz · 11 October 2023 06:56

Abstract

This specification covers the collection and delivery of metrics about the runners of the GitHub Runners charm and the charm itself. It targets the COS platform for metrics processing and visualisation. The content of this specification is based on the content of the (private) specification ISD069 - Everything about GitHub Runner Metrics.

Rationale

The current implementation of GitHub runners in production lacks metrics and logs, which prevents us from gaining detailed insight into machine utilisation and evaluating the efficiency of our GitHub runner charm.

In this specification, we target long-term monitoring as a goal, meaning that a delay of several minutes before accessing updated metrics would be sufficient.

Specification

Metrics

We can distinguish between metrics about the runners (e.g. total jobs started) and metrics about the charm (e.g. reconciliation time).

We need to further distinguish between the metrics collected for an instance of a charm and its runners, and the aggregation of these metrics. Aggregation is necessary because there may be multiple GitHub Runner charm applications and units deployed in different locations. The aggregation and presentation of metrics should be configured within the COS deployment.

There are also second-order metrics (metrics calculated from other metrics). These can also be calculated on the COS side and should not be handled by the metric processing code in the charm or its runners.

This should not mean that a Grafana dashboard or Prometheus alert rule definition cannot be added to the GitHub Runner repository, it is just outside the scope of this specification.

Most of the following metrics have already been covered in detail in ISD069 - Everything about GitHub Runner Metrics , but are repeated here for clarity.

Labels

Metrics can be further differentiated into characteristics upon which aggregation can be applied:

workflow: The workflow name which triggered a job
repo: The name of the repository on which a job has been triggered
event: The name of the triggering event, e.g. push or pull_request
flavor: The runner’s flavour, determined by the charm application name, e.g. two-xlarge)

Metrics about the Runners and Jobs

Total Runners initialised

This counter metric keeps track of the total number of runners created by the GitHub Runner Charm.

Labels: flavor

Total Started Jobs

This counter metric measures the total number of runners that have started a workflow job.

Labels: workflow, repo, event, flavor

Total Completed Jobs

This counter metric monitors the total number of runners that have normally completed a workflow job (runners which have reached post-job phase).

Labels: workflow, repo, event, flavor

Total Crashed Runners

This counter metric monitors the total number of crashed runners (runners which have reached pre-job but not post-job phase).

Labels: flavor

Idle Runners

This gauge metric counts the number of available runners ready to undertake a job.

Labels: flavor

Runner installation duration

This metric measures the time required to install all the dependencies within a GitHub runner.

Labels: flavor

Job run duration

This metric measures the duration of a job run.

Labels: workflow, repo, event, flavor

Idle duration

This metric determines the amount of idle time a runner experiences before engaging in a job.

Labels: flavor

Job queue duration

This metric measures the time elapsed between a job being created and a runner picking it up.

Labels: flavor

Metrics about the charm and host

Reconciliation duration

This metric measures the time to finish a reconciliation run.

Machine information

These are multiple metrics about the host like CPU usage, etc.

Second order metrics

Active Runners

This metric indicates the number of runners actively executing a job. It’s computed by subtracting the ‘Total Completed Jobs’ and ‘Total Crashed Runners’ values from the ‘Total Started Jobs’ values.

Machine utilisation

This metric gauges the utilisation of machine hours by the runners. It’s calculated by assessing the percentage of time during which the number of active runners is non-zero.

Metric exposure and instrumentation

As mentioned above, the COS stack is the target platform for this specification. Loki, the log aggregation component of the stack, is able to compute metrics on structured log entries using metric queries with LogQL. This option is also mentioned in ISD069 - Everything about GitHub Runner Metrics. Relevant events that contain information to calculate the metrics defined above could therefore be logged by the charm and a corresponding LogQL expression could be defined to compute the metric on the COS side.

For example, a log entry for a runner job started event might be

{
"event": "runner_start",
"log_timestamp": 1694687032,
"timestamp": 1694686432,
"flavor": "small",
"workflow": "test",
"repo": "canonica/test",
"github_event": "push",
"idle": 600
}

and a corresponding LogQL expression to calculate the metric “Total Started Jobs” for the last 5 minutes:

count_over_time({event="runner_start"}[5m])

Access to metrics

In order to collect the metrics, access to information inside the Runner is required (e.g. a timestamp of being in the pre-job phase). For security reasons, the Runner’s access to the outside world should be very limited. ISD069 - Everything about GitHub Runner Metrics has outlined several methods to access/transmit metrics information. The shared file system approach seems appropriate. Care must be taken to follow up on any security flaws discovered in the sharing of file systems between hosts and guests.

Information about job queuing times is only available via the GitHub API.

Implementation proposal

Use Promtail for metric delivery

The charm should install Promtail and use it to collect metric sources, which the charm outputs as log lines, and pass them to Loki.

We assume that the log files are structured as JSON in the form

{
"event": "runner-start" /* event triggering metric */,
"log_timestamp": 1694687032 / *unix ts of when log has been emitted */,
...
}

so e.g.

{
"event": "runner_start",
"log_timestamp": 1694687032,
"timestamp": 1694686432,
"flavor": "small",
"workflow": "test",
"repo": "canonica/test",
"github_event": "push",
"idle": 600
}

Here is a sample configuration for Prometheus:

server:
  disable: true

clients:
  - url: http://loki:3100/loki/api/v1/push

positions:
  filename: /tmp/positions.yaml
scrape_configs:
  - job_name: metrics
    static_configs:
     - targets:
        - localhost
       labels:
        job: runner-metrics
        __path__: /home/charm/runner-metrics.log
    pipeline_stages:
      - json:
          expressions:
            event: event
            timestamp: log_timestamp
      - timestamp:
        source: timestamp
        format: Unix
      - labels:
        - event:
      - labelallow:
          - event
          - job

The above configuration would extract the label event with the value runner_start, which would create a distinguished Loki stream (along with the other label “job”).

To filter the content of the structured logs, Loki provides powerful capabilities. E.g.,

to filter for e.g. idle time and calculate an average of the last 5 minutes in Loki:

avg_over_time({job="runner-metrics", event="runner_start"}| json idle="idle" | unwrap idle [5m])

To calculate the runners started count for a certain flavour, following expression could be used:

count_over_time({job="runner-metrics", event="runner-start"} json flavor="flavor" | flavor = "small" [5m])

One could try to extract more labels from the structured logs to speed up querying, but having a too high label cardinality is discouraged.

The url of the Loki server must be obtained by a proper COS integration, e.g. loki_push_api . The charm has therefore to be adapted to support this integration.

Adaption of Runner

A runner should output job information during the pre-job and post-job phases to a shared file system created by the charm.

In the pre-job phase the runner should output a json file called pre-job-metrics.json with the following content

{
"workflow": <The workflow name, sourced from the GITHUB_WORKFLOW environment variable >
"repository": <The name of the repository that made the request, obtained from the GITHUB_REPOSITORY environment variable in the pre-run script.>
"event": <The name of the triggering event, collected from the GITHUB_EVENT_NAME environment variable.>
"timestamp": <current unix timestamp>,
"workflow_run_id": <The name of the run id, sourced from the GITHUB_RUN_ID environment variable>
}

So, e.g.

{
"workflow": "Integration Tests"
"repository": "canonical/upload-charm-docs"
"event": "push"
"timestamp": 1694002895,
"workflow_run_id": 1658821493
}

If the runner aborts in the pre-job phase due to a failed repo-policy check, it should emit a file called post-job-metrics.json with the following content:

{
"timestamp": <current unix timestamp>,
"status": "repo-policy-check-failure"
}

Note that in this case the runner does not reach the post-job phase.

In the post-job phase, the runner should emit a file called post-job-metrics.json with following content:

{
"timestamp": <current unix timestamp>,
"status": "normal"
}

Adaptation of Charm

Shared file system

The charm should create a new file system for each runner and mount it in the runner VM.

To create a file system, the following commands could be used

dd if=/dev/zero of=runner-xy-fs.img bs=20M count=1
mkfs.ext4 runner-xy-fs.img
mount -o loop runner-xy-fs.img /path/to/mount
lxc config device add runner-xy-fs-vm home disk source=/path/to/mount path=/metrics-exchange

The charm should place a timestamp in the runner’s file system after the installation is complete, so that the idle time can be calculated.

Metrics calculation and event output

The charm should examine and transmit the metrics after runner installation and during reconciliation. The transmission of metrics is done in an event-based approach: Structured json logs are emitted in a dedicated file which are picked up by Promtail and sent to Loki (see section above).

When the charm has finished installing a runner on the host machine, it should issue a log similar to the following:

{
"event": "runner-installed",
"timestamp": 1694690796 /* current unix timestamp */,
"flavor": "small",
"duration": 300 /* installation duration in seconds */
}

During reconciliation, the charm should examine the shared file systems created.

If a pre-job-metric.json already exists, it should parse and validate the data and issue a log like:

{
"event": "runner_start",
"log_timestamp": 1694687032,
"timestamp": 1694686432, /* from pre-job-metrics.json */
"flavor": "small",
"workflow": "test", /* from pre-job-metrics.json */
"repo": "canonica/test", /* from pre-job-metrics.json */
"github_event": "push", /* from pre-job-metrics.json */
"idle": 600 /* difference between timestamp above and the one placed by charm in the fs */
}

If a post-job-metric.json exists, a log like the following should be emitted:

{
"event": "runner_stop",
"log_timestamp": 1694687032,
"timestamp": 1694686432, /* from post-job-metrics.json */
"flavor": "small",
"workflow": "test", /* from pre-job-metrics.json */
"repo": "canonica/test", /* from pre-job-metrics.json */
"github_event": "push", /* from pre-job-metrics.json */
"status": "normal", /*from pre-job-metrics.json */
"duration": 600 /* difference between ts above and the one from pre-job-metrics.json */
}

Special care must be taken when validating the files from the shared file system (timestamp placed by charm and {pre, post}-job-metrics.json), as a malicious runner may have modified them.

For example, a malicious JSON string can cause the Python stdlib decoder to consume significant CPU and memory resources. In particular, the size of those files should be checked before they are parsed.

The shared file system should have a limited size, but a double check might be safer to avoid cases where someone changes the size of the shared file systems (e.g. to hold logs) but forgets to add extra code to check the size of those files.

Furthermore, the logs are passed to the COS stack (Loki), and we should limit any possible exploitation of security vulnerabilities by checking that the files conform to the scheme defined above. Libraries like jsonschema or pydantic can be used for this.

The charm can calculate additional metrics during reconciliation. It can detect crashed runners by associating metrics filesystems with unhealthy or offline machines whose metrics filesystem contains a pre-job-metric.json but no post-job-metric.json (meaning that the job was started but never completed).

The charm can also detect all idle runners by looking at all healthy online runners that do not have a pre-job timestamp on their shared file system. The job queue time (see section below) should also be calculated. Finally, the following logs should be issued:

{
"event": "reconciliation",
"log_timestamp": 1694687032,
"flavor": "small",
"crashed_runners": 10,
"idle_runners": 12,
"duration": 600,
}

{
"event": "job_queuing",
"log_timestamp": 1694687032,

"flavor": "small",
"duration": 600
}

Cleanup

For each metric filesystem without an associated runner, it should clean up the filesystem. If the above method is used to create a filesystem, a simple deletion of runner-xy-fs.img and /path/to/mount should be sufficient.

Job queue time

Job queue time can only be retrieved via the GitHub API.

The appropriate GitHub API endpoints would be https://docs.github.com/en/rest/actions/workflow-jobs?apiVersion=2022-11-28#get-a-job-for-a-workflow-run and https://docs.github.com/en/rest/actions/workflow-jobs?apiVersion=2022-11-28#list-jobs-for-a-workflow-run. The difference between the started_at and created_at timestamp can be used to denote the queuing time. The workflow run ID is required, and this information is available from the GITHUB_RUN_ID variable inside the Runner.

The charm can perform the calculation after examining the pre-job-metrics.json file for a runner that contains the required workflow run id. Once the duration has been calculated, it can be emitted in a structured log as mentioned in the above section.

Logs and machine information

Grafana Agent Machine Charm can be used for charm unit and repo compliance service logs and for machine information metrics. The Grafana Agent uses an embedded Promtail for picking up logs, which has to be distinguished from the Promtail instance which delivers the metrics to Loki.

For crashed runners whose VMs are still alive, log retrieval via LXD file commands should be tried.

The job logs of the runner machines can be accessed in the _diag directory of the runner application. One could also try accessing /var/log/syslog for more information.

The logs should be copied to the /var/log directory on the host, as the Grafana Agent Machine charm configures Promtail to automatically pick up logs in this directory.

These log file names should be prefixed with the runner-vm name. Promtail should append the filename as a label, which should allow the different crashed runner-vms to be distinguished.

Care must be taken not to increase the cardinality of the labels too much, i.e. it may be necessary to change the approach of using multiple logs by merging them into one file to reduce the number of labels.

Another approach would be to use a different Promtail configuration for these logs (and drop the filename labels if cardinality might be an issue) and use the Promtail instance used for metrics delivery to collect these logs.

Promtail security issues should be followed, as these logs retrieved from the runners could be corrupted by a malicious runner, and a bug in Promtail could lead to security issues.

Also, additional cleanup or implementation of logrotate is required, i.e. the log files moved to /var/log should be deleted after a period of time.

If it turns out that many crashed units are no longer alive and are removed because the LXD instance is ephemeral, it may be necessary to change the implementation to redirect the

logs to the shared file system for post-mortem examination.

Error handling

Any error in metrics handling should not prevent a runner or the charm from working. Metrics processing should be considered optional. Therefore, any exception related to the processing should be properly caught and logged. Loki alert rules should be defined to alert in case of too many errors.

Further Information

Discarded options

Metrics service

Instead of using Promtail to deliver the logs to Loki or expose the metrics to be scraped by Prometheus, the charm could install a metrics exposure and instrumentation service (metrics_service), similar to the repository compliance check (repo_check_web_service). This metrics service could then be scraped by Prometheus. The Prometheus client, deployed as a flask application, which exposes the metrics on a port to be specified, could be used. In addition, a HTTP API could be added to the metrics_service to add instrumentation. For each first-order metric defined above, an endpoint could be added. In addition, this service could be able to issue one-time tokens that can be used within the Runner VM to add instrumentation.

A simple example of a metrics service with two instrumentation endpoints:

from flask import Flask,Blueprint, Response, request
from flask_httpauth import HTTPTokenAuth
from flask_pydantic import validate

from werkzeug.middleware.dispatcher import DispatcherMiddleware
from prometheus_client import make_wsgi_app, Counter, Histogram

from somewhere import charm_app_name, JobInfoRequestBody, JobInfoWithDurationRequestBody

runner_metrics = Blueprint("runner_metrics", __name__)
auth = HTTPTokenAuth(scheme="Bearer")

c = Counter('runner_started', 'Runner Starts', ['workflow', 'repo', 'event', 'flavor'])

@runner_metrics.route('/runners/started', methods=["POST"])
@auth.login_required
@validate()
def runner_started_counter(body: JobInfoRequestBody) -> Response:
c.labels(body.workflow, body.repo, body.event, charm_app_name).inc()

h = Histogram('jobrun_duration_seconds', 'Duration of a jobrun',
labels=['workflow', 'repo', 'event', 'flavor'], buckets=[ 30, 60, 120, 300, 600, 1200, 3000, 6000 ])

@runner_metrics.route('/job/run', methods=["POST"])
@auth.login_required
@validate()
def job_run_histo(body: JobInfoWithDurationRequestBody) -> Response:
h.labels(body.workflow, body.repo, body.event, charm_app_name).observe(body.duration)

# Create my app
app = Flask(__name__)
app.register_blueprint(runner_metrics)

# Add prometheus wsgi middleware to route /metrics requests
app.wsgi_app = DispatcherMiddleware(app.wsgi_app, {
'/metrics': make_wsgi_app()
})

Implementing the endpoints would be straightforward. This option was rejected as it would require more effort to implement than using Promtail.

Using GitHub API

Many of the above metrics can be calculated using the GitHub API. However, in order to use the API, we need to know the workflow run, and this information is easily accessible from within the Runner (using the GITHUB_RUN_ID environment variable). So a simple implementation would still require access to the Runner. An alternative would be to list all repositories and all self-hosted runners and iterate through all workflows to find a match, but this does not seem to be a good way to go in terms of implementation complexity and performance.

A disadvantage of using the GitHub API would be that the metrics code would rely on the stability of the GitHub API. Currently, the API is supported for at least 24 months after a new API version is released .

Also, it might be difficult to get all the metrics using only the GitHub API (e.g. if the Runner crashed, it might not be easy to distinguish between a normal job failure and a crash).

Using HTTP requests to exchange information between Runner and Metrics Server

An additional method that was considered was to access a metrics server from within the Runners via HTTP requests with one-time tokens, similar to what is currently done when accessing the repo_check_web_service.

The benefit over other methods would be a slight delay in updating the metrics and the ability to use them when running runners on remote LXD clusters.

The use of one-time tokens and message validation aims to ensure that access is highly restricted.

There were concerns about the safety of this approach. Additionally, status tracking on the metrics server side would be required to detect crashed units (since one proposed detection method would be to detect if a runner has reached the post-job stage).

Using LXD file command to exchange information between Runner and Charm

The charm could access files written inside the Runner. This would probably be the safest option from a security point of view, as the Runner machine has no interaction with the charm or other running processes in the unit. This requires that the Runner VMs are still running after the job has finished, because the lxd agent needs to run inside them. However, currently the runner VMs are shut down and removed after termination. Therefore, we would need to delay VM termination at some point after metric extraction, which would reduce overall security.