[specification] ISD077 Jenkins-k8s: Jenkins observability

Abstract

This document discusses the approach towards observability into the Jenkins-k8s charm. For k8s charms, the observability team provides the COS which consists of Prometheus, Grafana and Loki as the core components of observability and much more.

The observability solution consists of 3 parts:

  1. Jenkins server metrics for prometheus
  2. Jenkins server logs for loki
  3. Jenkins monitoring solution for Grafana custom dashboard

The monitoring solution is designed for the Jenkins-k8s server charm, while considering the agent’s observability for later stages.

Rationale

Jenkins provides very simple monitoring metrics natively.

However, it only exposes 4 metrics: online executors, busy executors, queue length and available executors. It is not enough to get visibility into the state of the container and the JVM.

The jenkins-k8s charmed application can only have 1 unit of deployment due to the fact that the jenkins application has no cluster mode. This aspect of Jenkins makes it quite sensitive to load. Therefore, observability plays a critical role in maintaining the Jenkins infrastructure.

Furthermore, Jenkins does not provide metrics by default, and it is hard to see the total number of Jobs queued at a time, making Jenkins management difficult at times.

Specification

Metrics

The proposed approach is to use the Jenkins prometheus plugin. The plugin exposes metrics ranging from JVM to number of executors registered via the /prometheus endpoint. Note that the endpoint is configurable via the PROMETHEUS_ENDPOINT environment variable. See the full list of configurable environment variables in the official plugins page.

The implementation of this method is straightforward. By using the prometheus_k8s/v0 library, a prometheus job can be passed as a dictionary. See the example in WordPress-k8s charm.

Other approaches that were considered are:

  • To deploy a custom sidecar web server that uses jenkins-cli to scrape metrics. While this method allows a wide range of customization, this option has been discarded due to the complexity of deployment.
  • Nagios Remote Plugin Exporter. This option has been discarded since there is a preference for the COS over NRPE.

Logs

By default, Jenkins that is started with jenkins.war file outputs logs to the stdout. Note that /var/lib/jenkins/logs paths exist, however, it contains agent node logs and tasks logs under slaves and tasks subdirectories, without the controller logs.

By using Java -Djava.util.logging.config.file system property, a logging.properties configuration file can be passed as an argument, defining the logging output to the desired location. An example of the logging.properties file can be found here.

The logging.properties file will contain the following configuration values:

  • Default java logging handler to capture all logs: java.util.logging.ConsoleHandler
  • Log capture level: INFO
  • Filehandler pattern: The pattern to generate the log file names with.

WIth the configuration settings above, a log file will be generated at JENKINS_HOME directory (currently set as /var/lib/jenkins) with java.log. The file is now ready for promtail worker, spawned by the loki library, to scrape.

Other approaches that were considered are:

  • JENKINS_LOG env variable: this doesn’t quite seem to work.
  • Open telemetry plugin w/ open telemetry collector: this requires another sidecar container integration which isn’t quite supported by the Loki library. Furthermore, the main use-case seems to be heavily revolved around tracing rather than logging.

Dashboard

The Grafana dashboard can be customized using the prometheus metrics gathered above. Metrics of interest can be further discussed. Examples of key metrics that can be collected can be referenced from a solution suggested by Grafana for Jenkins.

Further Information

At further stages of observability, it is apparent that monitoring into VM based solutions must also exist, to support observability into machine agents as well as k8s agents. This suggests that NRPE may need to be employed to collect metrics regarding the machine/host itself. Another way to achieve monitoring of agents through the server node is through a CRON based groovy script that exports jobs through a file.