< Back

Keeping our data pipelines under watch and on good behavior

By Alexis Chicoine — May 07, 2024

Introduction

Coveo’s data platform team is responsible for ingesting analytics data and making it available to internal teams as well as to customers. Over the last few years, we’ve matured in our practices, adding a lot more tests, resiliency to transient errors, and monitoring to our data pipelines. As a next logical step, we wanted to measure and visualize the rate at which we meet or break our service-level objectives. This article will cover the importance of service-level objectives and stability as well as the technical aspects of how we were able to measure them.

Service-level objectives

We define service-level objectives (SLO’s) for our data projects with a freshness service-level indicator (SLI) that varies by project, as well as a ratio of minutes (95%) for which the service-level indicator is acceptable. For example, one project might require processing data every 15 minutes, while for another it could be every 24 hours. We will then measure for what percentage of the day the threshold was achieved, and mark any day with less than 95% as falling outside of the objective. Here are some of the reasons why we provide service-level objectives to our internal users:

Expectations management: we want our users to know what to expect from each data source so they can evaluate if it fits their use case. If they need something that refreshes more frequently or has higher guarantees, they may want to look for a different data source or work with us so we can bring the source in line with their needs.

Performance monitoring: defining SLO’s makes it possible for us to identify performance or reliability issues from a customer point of view. We are then able to adjust pipelines to run on less compute (thus slower) for cost optimizations as long as it doesn’t impact the SLO’s.

Resource Allocation: by having agreed upon SLO’s, we can prioritize efforts on the projects that don’t meet them and justify efforts to improve reliability.

Operational context

To update data in a cost-effective manner, our data platform processes data incrementally in batch intervals that are tracked. The scheduler for the batches uses a table that contains the start of an interval, its end, and the time at which it was completed. For example it could look like this:

Start	End	Completion
3:05	3:10	3:15
3:10	3:15	3:20
3:15	3:20	3:30
3:20	3:25	4:05

In this example, we’re running intervals of 5 minutes that usually complete 5 minutes after the last run completed. Let’s say we set our service-level indicator threshold at 15 minutes. For the interval 3:15 to 3:20, it completed 10 minutes after the previous run, which is within the SLI. However, the interval 3:20 to 3:25 finished 35 minutes after 3:30 completion, which doesn’t meet the SLI. If more than 5% of the minutes in the day are not acceptable according to the SLI, then the SLO is not met.

We’re interested in measuring how many times the SLI is not met by identifying the number of late completions as in the example as well as measuring what percentage of minutes of the day the data was within the threshold.

Note that the SLI isn’t the same as the run frequency, and that’s on purpose as we want the runs to run more frequently to allow for recovery when transient failures occur.

Some SQL

To measure the SLO, we will use SQL queries against a table recording the completion of intervals like the one in the example. We will be able to compute at each minute of the day what the freshness of the data was without running queries at that minute to make an observation. The SQL code provided is valid in Snowflake, but should run in other databases with minimal changes.

Measuring how many runs are late

We’ll calculate how many pipeline completions were late by taking advantage of the lag function to calculate the time between completions.

Let’s first create some synthetic rows, so we can run the code without any tables or views.

SELECT to_timestamp_tz('2024-03-01 01:00:00 +0000') AS interval_start, 
       to_timestamp_tz('2024-03-01 01:05:00 +0000') AS interval_end, 
       to_timestamp_tz('2024-03-01 01:10:00 +0000') AS completion
UNION ALL 
SELECT to_timestamp_tz('2024-03-01 01:05:00 +0000') AS interval_start, 
       to_timestamp_tz('2024-03-01 01:10:00 +0000') AS interval_end, 
       to_timestamp_tz('2024-03-01 01:15:00 +0000') AS completion
UNION ALL
SELECT to_timestamp_tz('2024-03-01 01:10:00 +0000') AS interval_start, 
       to_timestamp_tz('2024-03-01 01:15:00 +0000') AS interval_end, 
       to_timestamp_tz('2024-03-01 01:25:00 +0000') AS completion
UNION ALL
SELECT to_timestamp_tz('2024-03-01 01:20:00 +0000') AS interval_start, 
       to_timestamp_tz('2024-03-01 01:25:00 +0000') AS interval_end, 
       to_timestamp_tz('2024-03-01 02:10:00 +0000') AS completion;

	INTERVAL_START	INTERVAL_END	COMPLETION
1	2024-03-01 01:00:00 +0000	2024-03-01 01:05:00 +0000	2024-03-01 01:10:00 +0000
2	2024-03-01 01:05:00 +0000	2024-03-01 01:10:00 +0000	2024-03-01 01:15:00 +0000
3	2024-03-01 01:10:00 +0000	2024-03-01 01:15:00 +0000	2024-03-01 01:25:00 +0000
4	2024-03-01 01:20:00 +0000	2024-03-01 01:25:00 +0000	2024-03-01 02:10:00 +0000

WITH completions AS (
    SELECT
        to_timestamp_tz('2024-03-01 01:00:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:05:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 01:10:00 +0000') AS completion
    UNION ALL
    SELECT
        to_timestamp_tz('2024-03-01 01:05:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:10:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 01:15:00 +0000') AS completion
    UNION ALL
    SELECT
        to_timestamp_tz('2024-03-01 01:10:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:15:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 01:25:00 +0000') AS completion
    UNION ALL
    SELECT
        to_timestamp_tz('2024-03-01 01:20:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:25:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 02:10:00 +0000') AS completion
),
calculated_minutes_between_completions AS (
    SELECT
        completion,
        LAG(completion) OVER (ORDER BY completion) AS previous_completion,
        datediff(minute, previous_completion, completion) AS minutes_between_completions
    FROM
        completions
)
SELECT
    to_date(completion) AS completion_day,
    SUM(to_number(minutes_between_completions > 15)) AS nb_events
FROM calculated_minutes_between_completions
GROUP BY all;

COMPLETION_DAY	NB_EVENTS
2024-03-01	1

As we have multiple pipelines, we improve this by creating a table or view containing objectives for each. First, we add the pipeline name to our completion data.

SELECT
    'A' AS pipeline_name,
    to_timestamp_tz('2024-03-01 01:00:00 +0000') AS interval_start,
    to_timestamp_tz('2024-03-01 01:05:00 +0000') AS interval_end,
    to_timestamp_tz('2024-03-01 01:10:00 +0000') AS completion
UNION ALL
SELECT
    'A' AS pipeline_name,
    to_timestamp_tz('2024-03-01 01:05:00 +0000') AS interval_start,
    to_timestamp_tz('2024-03-01 01:10:00 +0000') AS interval_end,
    to_timestamp_tz('2024-03-01 01:15:00 +0000') AS completion
UNION ALL
SELECT
    'A' AS pipeline_name,
    to_timestamp_tz('2024-03-01 01:10:00 +0000') AS interval_start,
    to_timestamp_tz('2024-03-01 01:15:00 +0000') AS interval_end,
    to_timestamp_tz('2024-03-01 01:25:00 +0000') AS completion
UNION ALL
SELECT
    'A' AS pipeline_name,
    to_timestamp_tz('2024-03-01 01:20:00 +0000') AS interval_start,
    to_timestamp_tz('2024-03-01 01:25:00 +0000') AS interval_end,
    to_timestamp_tz('2024-03-01 02:10:00 +0000') AS completion
UNION ALL
SELECT
    'B' AS pipeline_name,
    to_timestamp_tz('2024-03-01 01:00:00 +0000') AS interval_start,
    to_timestamp_tz('2024-03-01 02:00:00 +0000') AS interval_end,
    to_timestamp_tz('2024-03-01 02:10:00 +0000') AS completion
UNION ALL
SELECT
    'B' AS pipeline_name,
    to_timestamp_tz('2024-03-01 02:00:00 +0000') AS interval_start,
    to_timestamp_tz('2024-03-01 03:00:00 +0000') AS interval_end,
    to_timestamp_tz('2024-03-01 03:15:00 +0000') AS completion
UNION ALL
SELECT
    'B' AS pipeline_name,
    to_timestamp_tz('2024-03-01 03:00:00 +0000') AS interval_start,
    to_timestamp_tz('2024-03-01 04:05:00 +0000') AS interval_end,
    to_timestamp_tz('2024-03-01 04:55:00 +0000') AS completion
UNION ALL
SELECT
    'B' AS pipeline_name,
    to_timestamp_tz('2024-03-01 04:00:00 +0000') AS interval_start,
    to_timestamp_tz('2024-03-01 05:00:00 +0000') AS interval_end,
    to_timestamp_tz('2024-03-01 09:10:00 +0000') AS completion;

PIPELINE_NAME	INTERVAL_START	INTERVAL_END	COMPLETION
A	2024-03-01 01:00:00 +0000	2024-03-01 01:05:00 +0000	2024-03-01 01:10:00 +0000
A	2024-03-01 01:05:00 +0000	2024-03-01 01:10:00 +0000	2024-03-01 01:15:00 +0000
A	2024-03-01 01:10:00 +0000	2024-03-01 01:15:00 +0000	2024-03-01 01:25:00 +0000
A	2024-03-01 01:20:00 +0000	2024-03-01 01:25:00 +0000	2024-03-01 02:10:00 +0000
B	2024-03-01 01:00:00 +0000	2024-03-01 02:00:00 +0000	2024-03-01 02:10:00 +0000
B	2024-03-01 02:00:00 +0000	2024-03-01 03:00:00 +0000	2024-03-01 03:15:00 +0000
B	2024-03-01 03:00:00 +0000	2024-03-01 04:05:00 +0000	2024-03-01 04:55:00 +0000
B	2024-03-01 04:00:00 +0000	2024-03-01 05:00:00 +0000	2024-03-01 09:10:00 +0000

Then we can add the service-level indicator thresholds in another CTE, for example.

SELECT
    'A' AS pipeline_name,
    15 AS freshness_threshold_in_minutes
UNION ALL
SELECT
    'B' AS pipeline_name,
    2 * 60 + 30 AS freshness_threshold_in_minutes;

PIPELINE_NAME	FRESHNESS_THRESHOLD_IN_MINUTES
A	15
B	150

The query would then change to this:

WITH completions AS (
    SELECT
        'A' AS pipeline_name,
        to_timestamp_tz('2024-03-01 01:00:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:05:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 01:10:00 +0000') AS completion
    UNION ALL
    SELECT
        'A' AS pipeline_name,
        to_timestamp_tz('2024-03-01 01:05:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:10:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 01:15:00 +0000') AS completion
    UNION ALL
    SELECT
        'A' AS pipeline_name,
        to_timestamp_tz('2024-03-01 01:10:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:15:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 01:25:00 +0000') AS completion
    UNION ALL
    SELECT
        'A' AS pipeline_name,
        to_timestamp_tz('2024-03-01 01:20:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:25:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 02:10:00 +0000') AS completion
    UNION ALL
    SELECT
        'B' AS pipeline_name,
        to_timestamp_tz('2024-03-01 01:00:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 02:00:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 02:10:00 +0000') AS completion
    UNION ALL
    SELECT
        'B' AS pipeline_name,
        to_timestamp_tz('2024-03-01 02:00:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 03:00:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 03:15:00 +0000') AS completion
    UNION ALL
    SELECT
        'B' AS pipeline_name,
        to_timestamp_tz('2024-03-01 03:00:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 04:05:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 04:55:00 +0000') AS completion
    UNION ALL
    SELECT
        'B' AS pipeline_name,
        to_timestamp_tz('2024-03-01 04:00:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 05:00:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 09:10:00 +0000') AS completion
),
sli_thresholds AS (
    SELECT
        'A' AS pipeline_name,
        15 AS freshness_threshold_in_minutes
    UNION ALL
    SELECT
        'B' AS pipeline_name,
        2 * 60 + 30 AS freshness_threshold_in_minutes
),
calculated_minutes_between_completions AS (
    SELECT
        pipeline_name,
        completion,
        LAG(completion) OVER (PARTITION BY pipeline_name ORDER BY completion) AS previous_completion,
        datediff(minute, previous_completion, completion) AS minutes_between_completions
    FROM
        completions
)
SELECT
    pipeline_name,
    to_date(completion) AS completion_day,
    SUM(to_number(minutes_between_completions > freshness_threshold_in_minutes)) AS nb_events
FROM calculated_minutes_between_completions
INNER JOIN sli_thresholds USING (pipeline_name)
GROUP BY all;

PIPELINE_NAME	COMPLETION_DAY	NB_EVENTS
A	2024-03-01	1
B	2024-03-01	1

With this change, we have the number of completions with a time between completions more than the one defined in the SLI. This will let us know how many times a day we had issues, but not how long it took to recover. That’s where the next part measuring the percentage of minutes within the SLI comes in.

Measuring the percentage of minutes within the SLI

We could measure the percentage of minutes within the SLI by running a script every minute to observe the freshness of the pipeline and record it. However, coding this and maintaining it is work we’d rather not do, so we looked for a more efficient way to ensure our data pipelines meet our SLO’s for freshness. Instead, we leverage the completion data we already have from our scheduler, and some less commonly used SQL features to simulate these observations. We’ll first use the generator table function to create a row for each minute and then use an asof join (recently added to Snowflake) to find which completion was the latest for each minute.

Generating rows for each minute

Let’s start looking at a query we can use to generate a row for each minute over a period.

SELECT
    convert_timezone(
        'UTC',
        DATEADD(
            MINUTE,
            SEQ4(),
            dateadd(
                day,
                -30 * 4,
                date_trunc('minute', current_timestamp())
            )
        )
    ) AS timestamp
FROM
    TABLE(GENERATOR(ROWCOUNT => 60 * 24 * 30 * 4)) -- Generates timestamps for every minute for the last 4 months

Let’s first look at what the generator and seq4() functions are doing.

SELECT seq4()
FROM TABLE(GENERATOR(ROWCOUNT => 5));

seq4()
0
1
2
3
4

We can now better understand that the query starts from the minute 120 days from the current time and then moves forward 1 minute for each generated row.

For example here are five of the rows:

TIMESTAMP
2023-12-27 20:07:00.000 +0000
2023-12-27 20:08:00.000 +0000
2023-12-27 20:09:00.000 +0000
2023-12-27 20:10:00.000 +0000
2023-12-27 20:11:00.000 +0000

We will then find the completion that corresponds to each minute.

We’ll now generate the minutes from 1:10 to 2:10 and match them to the last completion.

WITH completions AS (
    SELECT
        to_timestamp_tz('2024-03-01 01:00:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:05:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 01:10:00 +0000') AS completion
    UNION ALL
    SELECT
        to_timestamp_tz('2024-03-01 01:05:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:10:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 01:15:00 +0000') AS completion
    UNION ALL
    SELECT
        to_timestamp_tz('2024-03-01 01:10:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:15:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 01:25:00 +0000') AS completion
    UNION ALL
    SELECT
        to_timestamp_tz('2024-03-01 01:20:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:25:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 02:10:00 +0000') AS completion
),
minutes AS (
    SELECT
        convert_timezone(
            'UTC',
            DATEADD(
                MINUTE,
                SEQ4(),
                (
                    SELECT
                        MIN(completion)
                    FROM
                        completions
                )
            )
        ) AS timestamp
    FROM
        TABLE(GENERATOR(ROWCOUNT => 100))
    WHERE
        timestamp BETWEEN (
            SELECT
                MIN(completion)
            FROM
                completions
        )
        AND (
            SELECT
                MAX(completion)
            FROM
                completions
        )
)
SELECT
    *
FROM
    minutes
ASOF JOIN completions
MATCH_CONDITION(
    minutes.timestamp >= completions.completion
);

#	TIMESTAMP	INTERVAL_START	INTERVAL_END	COMPLETION
1	2024-03-01 01:10:00.000 +0000	2024-03-01 01:00:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000
2	2024-03-01 01:11:00.000 +0000	2024-03-01 01:00:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000
3	2024-03-01 01:12:00.000 +0000	2024-03-01 01:00:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000
4	2024-03-01 01:13:00.000 +0000	2024-03-01 01:00:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000
5	2024-03-01 01:14:00.000 +0000	2024-03-01 01:00:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000
6	2024-03-01 01:15:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000	2024-03-01 01:15:00.000 +0000
7	2024-03-01 01:16:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000	2024-03-01 01:15:00.000 +0000
8	2024-03-01 01:17:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000	2024-03-01 01:15:00.000 +0000
9	2024-03-01 01:18:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000	2024-03-01 01:15:00.000 +0000
10	2024-03-01 01:19:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000	2024-03-01 01:15:00.000 +0000
11	2024-03-01 01:20:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000	2024-03-01 01:15:00.000 +0000
12	2024-03-01 01:21:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000	2024-03-01 01:15:00.000 +0000
13	2024-03-01 01:22:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000	2024-03-01 01:15:00.000 +0000
14	2024-03-01 01:23:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000	2024-03-01 01:15:00.000 +0000
15	2024-03-01 01:24:00.000 +0000	2024-03-01 01:05:00.000 +0000	2024-03-01 01:10:00.000 +0000	2024-03-01 01:15:00.000 +0000
16	2024-03-01 01:25:00.000 +0000	2024-03-01 01:10:00.000 +0000	2024-03-01 01:15:00.000 +0000	2024-03-01 01:25:00.000 +0000
17	2024-03-01 01:26:00.000 +0000	2024-03-01 01:10:00.000 +0000	2024-03-01 01:15:00.000 +0000	2024-03-01 01:25:00.000 +0000
18	2024-03-01 01:27:00.000 +0000	2024-03-01 01:10:00.000 +0000	2024-03-01 01:15:00.000 +0000	2024-03-01 01:25:00.000 +0000

We can see from the result that each minute was matched to the last completion.

We’ll now add to the final select to calculate how late the data was compared to the current minute.

SELECT
    *,
    datediff(minute, completion, timestamp) AS processing_delay_minutes

We can visualize the result with a quick chart in Snowsight showing on the x-axis the observation time in minutes and on the y axis how many minutes ago the last refresh happened. We can observe the freshness delay go up at a rate of 1 minute per minute and then crash down to 0 once a run completes.

snowsight_chart

Now that we have a freshness value for each minute, we can compare it to our objective and calculate the percentage of minutes that meets the objective.

We’ll wrap the previously final select in a CTE and add another step to calculate the percentage and get our final query. For our dashboard, we also add another query that counts the number of days with a percentage below 95%.

WITH completions AS (
    SELECT
        to_timestamp_tz('2024-03-01 01:00:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:05:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 01:10:00 +0000') AS completion
    UNION ALL
    SELECT
        to_timestamp_tz('2024-03-01 01:05:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:10:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 01:15:00 +0000') AS completion
    UNION ALL
    SELECT
        to_timestamp_tz('2024-03-01 01:10:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:15:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 01:25:00 +0000') AS completion
    UNION ALL
    SELECT
        to_timestamp_tz('2024-03-01 01:20:00 +0000') AS interval_start,
        to_timestamp_tz('2024-03-01 01:25:00 +0000') AS interval_end,
        to_timestamp_tz('2024-03-01 02:10:00 +0000') AS completion
),
minutes AS (
    SELECT
        convert_timezone(
            'UTC',
            DATEADD(
                MINUTE,
                SEQ4(),
                (SELECT MIN(completion) FROM completions)
            )
        ) AS timestamp
    FROM
        TABLE(GENERATOR(ROWCOUNT => 100))
    WHERE
        timestamp BETWEEN (SELECT MIN(completion) FROM completions)
        AND (SELECT MAX(completion) FROM completions)
),
freshness_per_minute AS (
    SELECT
        *,
        datediff(minute, completion, timestamp) AS processing_delay_minutes
    FROM
        minutes
    ASOF JOIN completions
    MATCH_CONDITION(
        minutes.timestamp >= completions.completion
    )
)
SELECT
    to_date(timestamp) AS completion_day,
    SUM(to_number(processing_delay_minutes < 15)) / COUNT(1) AS percentage_minutes_within_minutes_processing_delay_threshold
FROM
    freshness_per_minute
GROUP BY ALL;

COMPLETION_DAY	percentage_minutes_within_minutes_processing_delay_threshold
2024-03-01	50.8%

Note that the percentage formatting was done for the article and running the query will return a decimal number. I never liked multiplying decimals by 100 to give percentages as I believe that’s a purely formatting detail, but feel free to do so in your version of the code.

Our example day didn’t go well, but rest assured that our pipelines are a lot more stable than this. I just thought the example would be boring without a good delay.

The big picture

We can now build visualizations to look at our SLO’s using the completion tables in a single Snowflake account. If we stopped here, we would be in a painful situation as we have Snowflake accounts in multiple regions. To have a centralized dashboard for our SLO’s, we use Snowflake replication to replicate this data to a centralized account. Following the ideas shown in the late runs example (adding the extra columns to the correct selects, partition by, and joins), we add dimensions for the account, region and pipeline name to present visualizations at different levels of granularity.

Using the queries explained in the article, we built the below dashboard in Metabase, dissecting our SLO performance from a few different angles. On the left we have the percentage of minutes within the freshness objective over time by day for our three main pipelines, and just below the same graph is broken down by Snowflake account. On the right, we have the number of late runs per day which can help us diagnose problems and focus our efforts on error recovery or avoidance (making sure runs don’t fail vs. starting new runs when they do). The bottom part shows a bigger picture counting the number of days when less than 95% of minutes are within the SLI.

final_dashboard

Conclusion

In this article, we discussed the importance of providing SLO’s so that other teams can know what to expect from the data sources we provide, as well as helping us prioritize our efforts to make sure we meet these objectives. I then showed you how to leverage a table with interval completions and SQL functionality such as lag, generator, and asof join to measure our SLOs. In our case, we built the SLO dashboard at a time when we’re satisfied with the stability of our pipelines, but I encourage you to set it up first to motivate the effort and resources invested. We will definitely be doing it with our future pipelines.

If you’re passionate about software engineering, and you would like to work with other developers who are passionate about their work, make sure to check out our careers page and apply to join the team!

Data Pipeline SLO

Community

Tech Blog Menu