How to migrate the logs stored in Loki to another instance.

jose · 28 June 2024 23:13

How to migrate logs stores in one Loki instance to a fresh new one

Some days ago @selcem came to the Observability team with a simple question: “How to migrate the logs stored in one Loki deployment to another one?”

Since we had never done it before, we started looking for ideas on the internet… After some online research we did not find any mature solution. There is a blog post with an idea about how to solve that, a not well tested tool and a github issue and not much else.

After a few hours of trial and error, @selcem, @sed-i and me came across a solution to this problem.

Keep in mind that this solution worked well in our specific envirnoment an may not work in yours.

For de migration process use the same Loki charm channel/revision.

At glance the necessary steps we followed to migrate logs from one Loki to another are:

Stop original Loki instance.
Deploy a new Loki instance in the same model.
Stop the new Loki instance.
Verify Loki’s config files between instances.
Backup the chunks and the index from the origial instance.
Restore the backup into the new Loki instance.
Fix files and directories ownership.
Start the new Loki instance.
Verify that logs are stored in the new Loki instance.

Reported Scenario

In a customer deployment with cos model running on Juju Controller v3.1.6, juju upgrade to v3.4.3 was done as per juju controller upgrade procedure. After migration of cos model, loki charm degraded to Blocked state. Since recovery of juju unit of loki charm was not successfull (refresh, remove-unit), below described data restore procedure was applied.

Initial situaton

Let’s imagine we have Loki deployed using the loki-k8s charmed operator in a model named loki like this one:

Model  Controller  Cloud/Region        Version  SLA          Timestamp
loki   microk8s    microk8s/localhost  3.5.1    unsupported  16:27:46-03:00

App   Version  Status  Scale  Charm     Channel      Rev  Address         Exposed  Message
loki  2.9.6    active      1  loki-k8s  latest/edge  160  10.152.183.151  no

Unit     Workload  Agent  Address     Ports  Message
loki/0*  active    idle   10.1.9.236

Offer  Application  Charm     Rev  Connected  Endpoint  Interface      Role
loki   loki         loki-k8s  160  0/0        logging   loki_push_api  provider

Integration provider  Requirer       Interface     Type  Message
loki:replicas         loki:replicas  loki_replica  peer

Stop original Loki instance.

In order to stop Loki instance, we need to check it is running and the status of the charm itself.

Let’s verify Loki workload is running:

$ juju ssh --container loki loki/0 /charm/bin/pebble services
Service  Startup   Current  Since
loki     disabled  active   today at 20:14 UTC

And to verify charm status:

$ juju status | grep loki | grep active
loki      2.9.6    active      1  loki-k8s  latest/edge  160  10.152.183.125  no
loki/0*      active    idle   10.1.9.249

So, we have verified Loki workload is running and Loki charm is active/idle. Now, let’s stop the loki service named loki.

$ juju ssh --container loki loki/0 /charm/bin/pebble stop loki

And verify that it is actually stopped:

$ juju ssh --container loki loki/0 /charm/bin/pebble services
Service  Startup   Current   Since
loki     disabled  inactive  today at 20:14 UTC

As we know the Loki charm only restarts the workload if the services defined in Pebble layer change, and our service definition does not change we can be sure that we don’t need to do anything else.

If that is not the case, we can avoid Loki to be restarted by commenting the last line in charm.py file and adding a pass:

if __name__ == "__main__":
    # main(LokiOperatorCharm)
    pass

Deploy a new Loki instance in the same model.

Loki charm defines two storage: loki-chunk and active-index-directory , and we have deployed the original Loki charm with 2Gb for active-index-directory and 5Gb for loki-chunks, so let’s deploy the new one the same way.

$ juju deploy loki-k8s loki-new --channel edge --storage active-index-directory=2GB --storage loki-chunks=5GB --trust

Now let’s create an offer for the new charm, so charms in other models may send logs:

$ juju offer loki-new:logging
Application "loki-new" endpoints [logging] available at "admin/loki.loki-new"

After this, our model looks like this:

$ juju status
Model  Controller  Cloud/Region        Version  SLA          Timestamp
loki   microk8s    microk8s/localhost  3.5.1    unsupported  17:58:17-03:00

App       Version  Status  Scale  Charm     Channel      Rev  Address         Exposed  Message
loki      2.9.6    active      1  loki-k8s  latest/edge  160  10.152.183.151  no
loki-new  2.9.6    active      1  loki-k8s  latest/edge  160  10.152.183.76   no

Unit         Workload  Agent  Address     Ports  Message
loki-new/0*  active    idle   10.1.9.240
loki/0*      active    idle   10.1.9.236

Offer     Application  Charm     Rev  Connected  Endpoint  Interface      Role
loki      loki         loki-k8s  160  0/0        logging   loki_push_api  provider
loki-new  loki-new     loki-k8s  160  0/0        logging   loki_push_api  provider

Stop the new Loki instance.

As we are going to copy data from one instance to another we have to make sure the new Loki workload is not running, the same way we did before:

$ juju ssh --container loki loki-new/0 /charm/bin/pebble stop loki

Verify Loki’s config files between instances.

It’s important to verify that there is no difference between the config files of the two instances.

First of all, let’s download both files:

$ juju ssh --container loki loki-new/0 cat /etc/loki/loki-local-config.yaml > loki-new-conf.yaml

$ juju ssh --container loki loki/0 cat /etc/loki/loki-local-config.yaml > loki-conf.yaml

and now, a simple diff:

$ diff loki-conf.yaml loki-new-conf.yaml                                                                                             130

10c10
<     instance_addr: loki-0.loki-endpoints.loki.svc.cluster.local
---
>     instance_addr: loki-new-0.loki-new-endpoints.loki.svc.cluster.local
47c47
<   external_url: http://loki-0.loki-endpoints.loki.svc.cluster.local:3100
---
>   external_url: http://loki-new-0.loki-new-endpoints.loki.svc.cluster.local:3100

As we can see, the only differences we can see are the ones related to instance_addr and external_url. We are ok!

There is another important aspect to take into account: boltdb-shipper vs tsdb. The section schema_config in the config file looks like this:

schema_config:
  configs:
  - from: '2020-10-24'
    index:
      period: 24h
      prefix: index_
    object_store: filesystem
    schema: v11
    store: boltdb-shipper
  - from: '2024-06-28'
    index:
      period: 24h
      prefix: index_
    object_store: filesystem
    schema: v12
    store: tsdb

In order to avoid data corruption keep in mind:

If you just refreshed your old loki, then “tomorrow” is the start date (from key) for using tsdb.
If you migrate data to a new loki (our situation), then “now” is the start date (from key) for using tsdb.

Backup the chunks and the index from the original instance.

First, let’s create a directory to store the backup:

$ mkdir loki-data

Loki charm stores the chunks and the index in /var/lib/juju/storage/

$ juju ssh loki/0 ls -l /var/lib/juju/storage/
total 8
drwxr-xr-x 3 root root 4096 Jun 28 18:26 active-index-directory
drwxr-xr-x 3 root root 4096 Jun 28 18:26 loki-chunks

We need to backup these two directories. Please note that this step may take several minutes depending on the amount of log stored by Loki.

$ juju scp -v loki/0:/var/lib/juju/storage/ loki-data

Once this copy ends, we can verify we have the data backuped:

$ tree loki-data
loki-data
├── active-index-directory
│   └── 0
│       ├── tsdb-index
│       │   ├── multitenant
│       │   │   └── index_19902
│       │   │       ├── 1719603929-loki-0-1719599249308398412.tsdb
│       │   │       └── 1719604829-loki-0-1719599249308398412.tsdb
│       │   ├── per_tenant
│       │   ├── scratch
│       │   ├── uploader
│       │   │   └── name
│       │   └── wal
│       │       └── filesystem_2024-06-28
│       └── uploader
│           └── name
└── loki-chunks
    └── 0
        ├── fake
        │   ├── 1079a3bd097b9c20
        │   │   └── MTkwNjAxYTc2MmY6MTkwNjAxYTkwZmQ6OTc0NjU2ZTc=
        │   ├── 1952b92b2124347d
        │   │   └── MTkwNjA1MTI0ZDg6MTkwNjA1MTI1ZDA6MWFhNzY2NmM=
        │   ├── 1ece4d233b9fb8f3
        │   │   └── MTkwNjA1MTI0NWQ6MTkwNjA1MTI0NWU6OGZjYzJhMGY=
        │   ├── 25da34e9fbed6e35
        │   │   └── MTkwNjAxYTc2OGU6MTkwNjAxYTkxYmM6NjVmODIxZWM=

        ...

        ├── index
        │   └── index_19902
        │       ├── 1719603929-loki-0-1719599249308398412.tsdb.gz
        │       ├── 1719604829-loki-0-1719599249308398412.tsdb.gz
        │       └── fake
        │           └── 1719605254-compactor-1719599254219-1719602852841-b96f0369.tsdb.gz
        ├── loki-local-config.yaml.bak
        ├── loki_cluster_seed.json
        └── wal
            ├── 00000045
            ├── 00000046
            └── checkpoint.000044
                └── 00000000

79 directories, 2793 files

Restore the backup into the new Loki instance.

A required action before restoring our backup is to delete the files inside /var/lib/juju/storage/loki-chunks/0/wal, so let’s do that:

$ juju ssh loki-best/0 "rm -rf /var/lib/juju/storage/loki-chunks/0/wal/*"

Now, let’s restore the backup in two simple steps:

$ juju scp loki-data/loki-chunks/0/. loki-new/0:/var/lib/juju/storage/loki-chunks/0/

$ juju scp loki-data/active-index-directory/0/. loki-new/0:/var/lib/juju/storage/active-index-directory/0

Now a simple verification can be done in know whether the files are there or not:

$ juju ssh loki-new/0 ls -l /var/lib/juju/storage/loki-chunks/0
total 20
drwxr-xr-x 78 root root 4096 Jun 28 21:30 fake
drwxr-xr-x  3 root root 4096 Jun 28 21:01 index
-rw-rw-r--  1 1000 1000 1842 Jun 28 21:30 loki-local-config.yaml.bak
-rw-rw-r--  1 1000 1000  280 Jun 28 21:30 loki_cluster_seed.json
drwxr-xr-x  4 root root 4096 Jun 28 21:30 wal

$ juju ssh loki-new/0 ls -l /var/lib/juju/storage/active-index-directory/0
total 8
drwxr-xr-x 7 root root 4096 Jun 28 20:25 tsdb-index
drwxr-xr-x 2 root root 4096 Jun 28 21:30 uploader

Fix files and directories ownership.

Maybe you did not notice, but the ownership of the files and directories where not well preserved. In all cases the owner should be root and if we look carefully in the output of the step before we can see 1000. So let’s fix it!

$ juju ssh loki-new/0 "find /var/lib/juju/storage/loki-chunks/0/ -type f -exec chown root:root {} \;"

$ juju ssh loki-new/0 "find /var/lib/juju/storage/active-index-directory/0/ -type f -exec chown root:root {} \;"

A quick check that everything is owned by “root”:

$ juju ssh loki-new/0 ls -l /var/lib/juju/storage/loki-chunks/0
total 20
drwxr-xr-x 78 root root 4096 Jun 28 21:30 fake
drwxr-xr-x  3 root root 4096 Jun 28 21:01 index
-rw-rw-r--  1 root root 1842 Jun 28 21:30 loki-local-config.yaml.bak
-rw-rw-r--  1 root root  280 Jun 28 21:30 loki_cluster_seed.json
drwxr-xr-x  4 root root 4096 Jun 28 21:30 wal

Start the new Loki instance

Well, we are almost there. The final step in the migration process is to start loki service again in the new instance

$ juju ssh --container loki loki-new/0 /charm/bin/pebble start loki

And a quick verification:

$ juju ssh --container loki loki-new/0 /charm/bin/pebble services
Service  Startup   Current  Since
loki     disabled  active   today at 22:05 UTC

Verify that logs are stored in the new Loki instance

At this point the new Loki instance is running and storing all the logs stored in the original instace, but: Can we verify that? Of course we can! Let’s run a simple query:

$ curl -sG 10.1.9.240:3100/loki/api/v1/query --data-urlencode 'query=rate({job=~".+"}[3h])' | jq '.data.result'
[
  {
    "metric": {
      "container": "workload",
      "filename": "/bin/fake.log",
      "job": "juju_loki_650186a5_flog",
      "juju_application": "flog",
      "juju_charm": "flog-k8s",
      "juju_model": "loki",
      "juju_model_uuid": "650186a5-a17b-4b06-8c62-d70461c57895",
      "juju_unit": "flog/0"
    },
    "value": [
      1719615455.899,
      "37.051203703703706"
    ]
  }
]

and voilà!!

selcem · 1 July 2024 07:05

Hello Jose,

Here are my 4 comments;

1- After “Let’s verify Loki workload is running: “ part, Loki should show enabled state just before stopping it.

2- In below command duplicate “—trust” option was given;

$ juju deploy loki-k8s loki-new --channel edge --trust --storage active-index-directory=2GB --storage loki-chunks=5GB --trust

3- Within “Fix files and directories ownership” section, you may add removal of original files under “/var/lib/juju/storage/loki-chunks/0/wal” folder and keep restored ones only.

4- During procedure we also updated charm source code to pass within LokiOperatorCharm” and reverted change before pebble service start.

It may also worth to mention the scenario which required storage data migration, that loki was blocked after juju controller upgrade and model migration. If you prefer I can update the post also with these changes.

Regards, Selcem

jose · 1 July 2024 14:39

Thank you very much for your feedback @selcem ! Points addressed!