K8s Prometheus Operator

Monitoring Comet on Kubernetes using Prometheus Operator¶

NOTE: This feature was introducted in version 4.1.0 of the Comet Helm Chart.

Our Helm Chart optionally supports the creation of certain CRDs for the Prometheus Operator to configure Prometheus to monitor the Comet ML application on your cluster.

Probes¶

Probe CRDs tell Prometheus Operator how to configure a prober (usually the blackbox-exporter) to test the health of endpoints.

The Helm Chart can create two types of Probes:

Comet Liveness
Frontend Nginx Ingress

The Comet Liveness Probes will test the health of the /isAlive/ping endpoints for various components of the Comet ML application.

The Frontend Nginx Ingress Probes will test the availability of the Ingress for the Comet ML application, if one is used.

PrometheusRules¶

PrometheusRule CRDs tell Prometheus Operator how to configure Prometheus to evaluate given PromQL expressions and record their results to a new timeseries. These can be used to record pre-computed values over time, or to trigger Alerts which is what we will use them for.

The Helm Chart can create two types of Alerts:

Probe Failure Alerts
- Frontend Nginx Ingress
- Comet Liveness
Replica Health
- Firing if the desired replica count cannot be reached.

List of Alerts¶

Not all of these might exist if your deployment doesn't include the corresponding components.

CometMLBackendMpmReplicaHealth¶

CometMLBackendOptimizerReplicaHealth¶

CometMLBackendPostprocessReplicaHealth¶

CometMLBackendPythonReplicaHealth¶

CometMLBackendReactReplicaHealth¶

CometMLFrontendNginxReplicaHealth¶

CassandraReplicaHealth¶

MinioReplicaHealth¶

MysqlReplicaHealth¶

RedisReplicaHealth¶

By default all ReplicaHealth alerts will fire if the replication object (Deployment, StatefulSet, etc.) has been failing to acheieve the desired replica count.

CometMLIngressProbe¶

By default the IngressProbe will fire becomes sufficiently unavailable during the configured period of time.

CometMLLivenessApiIsAlivePingProbe¶

CometMLLivenessClientlibIsAlivePingProbe¶

CometMLLivenessOptimizerIsAlivePingProbe¶

CometMLLivenessPostprocessIsAlivePingProbe¶

By default the IsAlivePingProbes will fire if their corresponding endpoint becomes sufficiently unavailable during the configured period of time.

Configuration¶

Enabling the creation of these CRDs is done through the monitoring section of your values overrides. The simplest configuration, which only enables alerts for the deployment healths would be:

# ...
monitoring:
  promOperatorCRDs:
    alerts:
      replicas:
        enabled: true
# ...

Each other set of CRDs can be enabled like so:

# ...
monitoring:
  promOperatorCRDs:
    probes:
      # Will only do anything if you enable the creation of the ingress
      # in the frontend section.
      ingress:
        enabled: true
      # Requires that comet.cometHost be set correctly.
      cometLiveness:
        enabled: true
    alerts:
      # No dependencies. Will create a corresponding alert for all relevant
      # K8s Deployments (or other replicated objects).
      replicas:
        enabled: true
      # Will only be created if the corresponding probe is enabled.
      ingress:
        enabled: true
      # Will only be created if the corresponding probe is enabled.
      cometLiveness:
        enabled: true
# ...

Prober Configuration¶

The proberSpec section allows you to tell the Probes how to interact with your prober. By default they will assume you are using the blackbox-exporter in the monitoring namespace with a default configuration. However you can customize the settings to whatever matches the prober you are using.

monitoring:
  promOperatorCRDs:
    probes:
      proberSpec:
        ## This must be set to the custer url for your prober.
        url: "blackbox-exporter.monitoring.svc.cluster.local:19115"
        scheme: https
        ## The endpoint path to use for requesting the prober execute a probe.
        path: "/probe"
        # proxyUrl: ""

Alerts Configuration¶

By default use the quanitle_over_time function to try and keep the alerting from flapping or being affected by outliers. You have the option of tuning various parameters to ajust the sensitivity and aggressiveness of the alert. You can also override the default expression used for the alert and set your own custom expression for one or more of the replicated objects. The key for each is their component name (set in the app.kubernetes.io/component label).

monitoring:
  promOperatorCRDs:
    alerts:
      replicas:
        enabled: true
        ## How often to evaluate the rule.
        evaluationInterval: 30s
        ## How long the rule should be violated before alerting.
        failFor: 1m
        ## What percentile to consider when determining deployment health.
        percentile: "0.95"
        ## Over what time period the percentile should be calculated.
        percentileInterval: 5m
        ## Set these keys to override the alert's promQL expression for any of
        ## the replication objects.
        ## This will disable the use of the percentile & percentileInterval
        ## settings.
        # customExpressions:
        #   backend-mpm: ""
        #   backend-optimizer: ""
        #   backend-postprocess: ""
        #   backend-python: ""
        #   backend-react: ""
        #   frontend-nginx: ""
        #   cassandra: ""
        #   minio: ""
        #   mysql: ""
        #   redis: ""

Custom Expressions can also be set for one or more of the Comet Liveness alerts, using a dictionary instead of a string:

monitoring:
  promOperatorCRDs:
    alerts:
      cometLiveness:
        # ...
        customExpressions:
          /api/isAlive/ping: ""
          /clientlib/isAlive/ping: ""
          /postprocess/isAlive/ping: ""
          /optimizer/isAlive/ping: ""

Feb. 9, 2024