How to migrate logs stores in one Loki instance to a fresh new one
Some days ago @selcem came to the Observability team with a simple question: “How to migrate the logs stored in one Loki deployment to another one?”
Since we had never done it before, we started looking for ideas on the internet… After some online research we did not find any mature solution. There is a blog post with an idea about how to solve that, a not well tested tool and a github issue and not much else.
After a few hours of trial and error, @selcem, @sed-i and me came across a solution to this problem.
Keep in mind that this solution worked well in our specific envirnoment an may not work in yours.
For de migration process use the same Loki charm channel/revision.
At glance the necessary steps we followed to migrate logs from one Loki to another are:
- Stop original Loki instance.
- Deploy a new Loki instance in the same model.
- Stop the new Loki instance.
- Verify Loki’s config files between instances.
- Backup the chunks and the index from the origial instance.
- Restore the backup into the new Loki instance.
- Fix files and directories ownership.
- Start the new Loki instance.
- Verify that logs are stored in the new Loki instance.
Reported Scenario
In a customer deployment with cos
model running on Juju Controller v3.1.6, juju upgrade to v3.4.3 was done as per juju controller upgrade procedure. After migration of cos
model, loki charm degraded to Blocked
state. Since recovery of juju unit of loki charm was not successfull (refresh, remove-unit), below described data restore procedure was applied.
Initial situaton
Let’s imagine we have Loki deployed using the loki-k8s charmed operator in a model named loki
like this one:
Model Controller Cloud/Region Version SLA Timestamp
loki microk8s microk8s/localhost 3.5.1 unsupported 16:27:46-03:00
App Version Status Scale Charm Channel Rev Address Exposed Message
loki 2.9.6 active 1 loki-k8s latest/edge 160 10.152.183.151 no
Unit Workload Agent Address Ports Message
loki/0* active idle 10.1.9.236
Offer Application Charm Rev Connected Endpoint Interface Role
loki loki loki-k8s 160 0/0 logging loki_push_api provider
Integration provider Requirer Interface Type Message
loki:replicas loki:replicas loki_replica peer
Stop original Loki instance.
In order to stop Loki instance, we need to check it is running and the status of the charm itself.
Let’s verify Loki workload is running:
$ juju ssh --container loki loki/0 /charm/bin/pebble services
Service Startup Current Since
loki disabled active today at 20:14 UTC
And to verify charm status:
$ juju status | grep loki | grep active
loki 2.9.6 active 1 loki-k8s latest/edge 160 10.152.183.125 no
loki/0* active idle 10.1.9.249
So, we have verified Loki workload is running and Loki charm is active/idle
. Now, let’s stop the loki service named loki
.
$ juju ssh --container loki loki/0 /charm/bin/pebble stop loki
And verify that it is actually stopped:
$ juju ssh --container loki loki/0 /charm/bin/pebble services
Service Startup Current Since
loki disabled inactive today at 20:14 UTC
As we know the Loki charm only restarts the workload if the services defined in Pebble layer change, and our service definition does not change we can be sure that we don’t need to do anything else.
If that is not the case, we can avoid Loki to be restarted by commenting the last line in charm.py file and adding a pass
:
if __name__ == "__main__":
# main(LokiOperatorCharm)
pass
Deploy a new Loki instance in the same model.
Loki charm defines two storage: loki-chunk
and active-index-directory
, and we have deployed the original Loki charm with 2Gb for active-index-directory
and 5Gb for loki-chunks
, so let’s deploy the new one the same way.
$ juju deploy loki-k8s loki-new --channel edge --storage active-index-directory=2GB --storage loki-chunks=5GB --trust
Now let’s create an offer
for the new charm, so charms in other models may send logs:
$ juju offer loki-new:logging
Application "loki-new" endpoints [logging] available at "admin/loki.loki-new"
After this, our model looks like this:
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
loki microk8s microk8s/localhost 3.5.1 unsupported 17:58:17-03:00
App Version Status Scale Charm Channel Rev Address Exposed Message
loki 2.9.6 active 1 loki-k8s latest/edge 160 10.152.183.151 no
loki-new 2.9.6 active 1 loki-k8s latest/edge 160 10.152.183.76 no
Unit Workload Agent Address Ports Message
loki-new/0* active idle 10.1.9.240
loki/0* active idle 10.1.9.236
Offer Application Charm Rev Connected Endpoint Interface Role
loki loki loki-k8s 160 0/0 logging loki_push_api provider
loki-new loki-new loki-k8s 160 0/0 logging loki_push_api provider
Stop the new Loki instance.
As we are going to copy data from one instance to another we have to make sure the new Loki workload is not running, the same way we did before:
$ juju ssh --container loki loki-new/0 /charm/bin/pebble stop loki
Verify Loki’s config files between instances.
It’s important to verify that there is no difference between the config files of the two instances.
First of all, let’s download both files:
$ juju ssh --container loki loki-new/0 cat /etc/loki/loki-local-config.yaml > loki-new-conf.yaml
$ juju ssh --container loki loki/0 cat /etc/loki/loki-local-config.yaml > loki-conf.yaml
and now, a simple diff
:
$ diff loki-conf.yaml loki-new-conf.yaml 130
10c10
< instance_addr: loki-0.loki-endpoints.loki.svc.cluster.local
---
> instance_addr: loki-new-0.loki-new-endpoints.loki.svc.cluster.local
47c47
< external_url: http://loki-0.loki-endpoints.loki.svc.cluster.local:3100
---
> external_url: http://loki-new-0.loki-new-endpoints.loki.svc.cluster.local:3100
As we can see, the only differences we can see are the ones related to instance_addr
and external_url
. We are ok!
There is another important aspect to take into account: boltdb-shipper
vs tsdb
. The section schema_config
in the config file looks like this:
schema_config:
configs:
- from: '2020-10-24'
index:
period: 24h
prefix: index_
object_store: filesystem
schema: v11
store: boltdb-shipper
- from: '2024-06-28'
index:
period: 24h
prefix: index_
object_store: filesystem
schema: v12
store: tsdb
In order to avoid data corruption keep in mind:
- If you just refreshed your old loki, then “tomorrow” is the start date (
from
key) for using tsdb. - If you migrate data to a new loki (our situation), then “now” is the start date (
from
key) for using tsdb.
Backup the chunks and the index from the original instance.
First, let’s create a directory to store the backup:
$ mkdir loki-data
Loki charm stores the chunks and the index in /var/lib/juju/storage/
$ juju ssh loki/0 ls -l /var/lib/juju/storage/
total 8
drwxr-xr-x 3 root root 4096 Jun 28 18:26 active-index-directory
drwxr-xr-x 3 root root 4096 Jun 28 18:26 loki-chunks
We need to backup these two directories. Please note that this step may take several minutes depending on the amount of log stored by Loki.
$ juju scp -v loki/0:/var/lib/juju/storage/ loki-data
Once this copy ends, we can verify we have the data backuped:
$ tree loki-data
loki-data
├── active-index-directory
│ └── 0
│ ├── tsdb-index
│ │ ├── multitenant
│ │ │ └── index_19902
│ │ │ ├── 1719603929-loki-0-1719599249308398412.tsdb
│ │ │ └── 1719604829-loki-0-1719599249308398412.tsdb
│ │ ├── per_tenant
│ │ ├── scratch
│ │ ├── uploader
│ │ │ └── name
│ │ └── wal
│ │ └── filesystem_2024-06-28
│ └── uploader
│ └── name
└── loki-chunks
└── 0
├── fake
│ ├── 1079a3bd097b9c20
│ │ └── MTkwNjAxYTc2MmY6MTkwNjAxYTkwZmQ6OTc0NjU2ZTc=
│ ├── 1952b92b2124347d
│ │ └── MTkwNjA1MTI0ZDg6MTkwNjA1MTI1ZDA6MWFhNzY2NmM=
│ ├── 1ece4d233b9fb8f3
│ │ └── MTkwNjA1MTI0NWQ6MTkwNjA1MTI0NWU6OGZjYzJhMGY=
│ ├── 25da34e9fbed6e35
│ │ └── MTkwNjAxYTc2OGU6MTkwNjAxYTkxYmM6NjVmODIxZWM=
...
├── index
│ └── index_19902
│ ├── 1719603929-loki-0-1719599249308398412.tsdb.gz
│ ├── 1719604829-loki-0-1719599249308398412.tsdb.gz
│ └── fake
│ └── 1719605254-compactor-1719599254219-1719602852841-b96f0369.tsdb.gz
├── loki-local-config.yaml.bak
├── loki_cluster_seed.json
└── wal
├── 00000045
├── 00000046
└── checkpoint.000044
└── 00000000
79 directories, 2793 files
Restore the backup into the new Loki instance.
A required action before restoring our backup is to delete the files inside /var/lib/juju/storage/loki-chunks/0/wal
, so let’s do that:
$ juju ssh loki-best/0 "rm -rf /var/lib/juju/storage/loki-chunks/0/wal/*"
Now, let’s restore the backup in two simple steps:
$ juju scp loki-data/loki-chunks/0/. loki-new/0:/var/lib/juju/storage/loki-chunks/0/
$ juju scp loki-data/active-index-directory/0/. loki-new/0:/var/lib/juju/storage/active-index-directory/0
Now a simple verification can be done in know whether the files are there or not:
$ juju ssh loki-new/0 ls -l /var/lib/juju/storage/loki-chunks/0
total 20
drwxr-xr-x 78 root root 4096 Jun 28 21:30 fake
drwxr-xr-x 3 root root 4096 Jun 28 21:01 index
-rw-rw-r-- 1 1000 1000 1842 Jun 28 21:30 loki-local-config.yaml.bak
-rw-rw-r-- 1 1000 1000 280 Jun 28 21:30 loki_cluster_seed.json
drwxr-xr-x 4 root root 4096 Jun 28 21:30 wal
$ juju ssh loki-new/0 ls -l /var/lib/juju/storage/active-index-directory/0
total 8
drwxr-xr-x 7 root root 4096 Jun 28 20:25 tsdb-index
drwxr-xr-x 2 root root 4096 Jun 28 21:30 uploader
Fix files and directories ownership.
Maybe you did not notice, but the ownership of the files and directories where not well preserved. In all cases the owner should be root
and if we look carefully in the output of the step before we can see 1000
. So let’s fix it!
$ juju ssh loki-new/0 "find /var/lib/juju/storage/loki-chunks/0/ -type f -exec chown root:root {} \;"
$ juju ssh loki-new/0 "find /var/lib/juju/storage/active-index-directory/0/ -type f -exec chown root:root {} \;"
A quick check that everything is owned by “root”:
$ juju ssh loki-new/0 ls -l /var/lib/juju/storage/loki-chunks/0
total 20
drwxr-xr-x 78 root root 4096 Jun 28 21:30 fake
drwxr-xr-x 3 root root 4096 Jun 28 21:01 index
-rw-rw-r-- 1 root root 1842 Jun 28 21:30 loki-local-config.yaml.bak
-rw-rw-r-- 1 root root 280 Jun 28 21:30 loki_cluster_seed.json
drwxr-xr-x 4 root root 4096 Jun 28 21:30 wal
Start the new Loki instance
Well, we are almost there. The final step in the migration process is to start loki service again in the new instance
$ juju ssh --container loki loki-new/0 /charm/bin/pebble start loki
And a quick verification:
$ juju ssh --container loki loki-new/0 /charm/bin/pebble services
Service Startup Current Since
loki disabled active today at 22:05 UTC
Verify that logs are stored in the new Loki instance
At this point the new Loki instance is running and storing all the logs stored in the original instace, but: Can we verify that? Of course we can! Let’s run a simple query:
$ curl -sG 10.1.9.240:3100/loki/api/v1/query --data-urlencode 'query=rate({job=~".+"}[3h])' | jq '.data.result'
[
{
"metric": {
"container": "workload",
"filename": "/bin/fake.log",
"job": "juju_loki_650186a5_flog",
"juju_application": "flog",
"juju_charm": "flog-k8s",
"juju_model": "loki",
"juju_model_uuid": "650186a5-a17b-4b06-8c62-d70461c57895",
"juju_unit": "flog/0"
},
"value": [
1719615455.899,
"37.051203703703706"
]
}
]
and voilà!!