Dashboards
There are many ways of visualising the metrics Reaper exposes. Here is just a few examples to get you started.
We are going to use Grafana to visualize metrics fetched from a Prometheus backed. We will also assume the metrics were relabeled.
Repair progress
One of the simplest things to visualize is repair progress. We can do that using a simple Gauge
, where we use use the following query:
io_cassandrareaper_service_RepairRunner_repairProgress
To make the panel more pretty, we can enter the following in the Legend
field:
{{cluster}} - {{keyspace}}
With two keyspaces doing repairs, we should see two dials, one for each keyspace:
We can now switch the Gauge
visualisation for a Stat
one, which will give us the repair progress, but also a small graph showing the progress over time:
To monitor the repair progress in terms of the number of repaired segments, we can use the following query:
io_cassandrareaper_service_RepairRunner_segmentsDone
Again, we put this query into the Stat
panel, but we also select Last
for the Calc
field in the Display
part of the Visualization
tab. We should get a chart that looks like this:
Segment duration
When monitoring repairs, we would like to know how much time it takes to repair a segment. As a rule of thumb, we would like each segment to take 10 to 15 minutes. This indicates the segments are big enough to not waste the overhead of a repair session and minimise over-streaming, while not being too large to risk streaming timeouts.
To check segment durations, we will use the SegmentRunner_runRepair
metric. This metric is a timer covering the duration from the moment a SegmentRunner
wakes up to repair a segment until the segment finishes. In this case, we will use a Graph
and feed it the following querry:
io_cassandrareaper_service_SegmentRunner_repairing{quantile="0.5"}
Unlike previous examples, here we add a filter to explicitly pick the 0.5
th percentile because we are interested in the majority of the segments, not just the longest ones. Reaper already gives us this duration in seconds, so we pick Seconds (s)
as a Unit
for the Left Y
axis in the Axes
section of the Visualization tab. We should end up with a graph that looks like this:
We see our segments are taking ~30 seconds, which is way below the desired ~10 minutes. We should tweak our repairs to use less segments.
Segments per hour
For the last two graphs, we’ll try something harder. We’ll try to plot the number of segments repaired per hour.
We create another Graph
panel and feed it a query like this:
(increase(io_cassandrareaper_service_RepairRunner_segmentsDone[1h]))
The above simply means we ask Prometheus to give us the increase of the given metric in a 1 hour window.
Next, we enter 1h
as a Min step
in the query tab (it’s next to the Legend
field), and we set 24h
as a Relative time
. In the Visualization
tab, we select to draw Bars
and turn on stacking. We should end up with a graph like this:
Note that the panel now says it tracks the last 24 hours in the top right corner - this is because of the relative time we selected. Also, even though it’s not obvious from the graph, Grafana will draw a bar only after that hour passes. In other words, you might need to wait a bit before you start to see your bars.
Segments in the last hour
Finally, we can also plot the number of segments repaired in the last hour only. We use the same query, min step
and Relative time
as before, but this time we put it into a Bar Gauge
with a vertical orientation. We should see this: