Dealing with Lambda Cold Starts in the Real-time Data Pipeline
In our previous post, we introduced the new architecture of the streaming data pipeline at Coveo, showing how it helps us achieve higher data quality, extensibility, scalability, and resilience. We also mentioned some of the challenges we faced. One of the most critical ones is cold starts in AWS Lambda. In this post, we will dive deep into this issue, share our approach to overcoming it, and showcase the results we’ve achieved along the way.
Building a resilient and high performance real-time data pipeline using AWS Serverless Technologies Part 1
At Coveo, we track how end-users interact with search interfaces by capturing client-side and server-side signals from our customers’ implementations. Initially, we only collected client-side events through Usage Analytics Write API which implementers can use to log Click, View, Search, and Custom Events. These events are used by Coveo Machine Learning models to provide relevant and personalized experiences for end-users. These events are also used by implementers to build reports and dashboards where they can gain insights into user behaviors, and make informed decisions to optimize Coveo solutions. The diagram below shows the real-time data pipeline that receives and processes client-side events.
Original real-time data pipeline
The Curious Case of a Service-level Objective
The context
The site reliability engineering (SRE) team at Coveo is currently hard at work implementing tools and processes with a lofty goal in mind: moving our existing monitoring culture in R&D toward the systematic use of service-level objectives (SLO). Writing blogs about SLOs or announcing products making use of them is pretty common nowadays, and understandably so. Yet I’m finding that most of the discourse around this topic is limited to the same kind of examples and use cases. In this blog post, I will tell the convoluted story of a definitely unconventional SLO.
Keeping our data pipelines under watch and on good behavior
Introduction
Coveo’s data platform team is responsible for ingesting analytics data and making it available to internal
teams as well as to customers. Over the last few years, we’ve matured in our practices, adding a lot more tests,
resiliency to transient errors, and monitoring to our data pipelines. As a next logical step, we wanted to measure
and visualize the rate at which we meet or break our service-level objectives. This article will cover the importance
of service-level objectives and stability as well as the technical aspects of how we were able to measure them.
Coveo Blitz, où il faut développer à la vitesse de l'éclair
En janvier dernier avait lieu la 14ᵉ édition de Coveo Blitz, notre compétition annuelle de programmation pour étudiants. Ceux qui sont familiers avec l’évènement reconnaîtront la formule des dernières années : on y présente un jeu de notre cru, puis les participants disposent de 10 heures pour programmer un bot qui saura y jouer et triompher dans des matchs de 2 ou 4 équipes.
Cette année, notre défi revêt le thème de l’espace : chaque équipe est aux commandes de l’équipage d’un vaisseau qui doit affronter d’autres équipes afin d’être la dernière survivante. Nos concepteurs ont concocté un jeu qui était à mon sens l’un des plus sophistiqués, mais aussi l’un des plus complexes des dernières éditions : il y avait une grande variété d’actions possibles, et donc de stratégies à explorer, sans compter certains défis techniques au niveau de l’implémentation. J’aimerais dès lors explorer certaines des stratégies employées lors de la dernière compétition, mais plus important, comment en tant que joueur, on peut s’attaquer à ce genre de défi.
Patterns for project failure
Introduction
My name is Nicolas Juneau and I am Coveo’s CFO (Chief Failure Officer). As the blog review team
has yet to unpack a huge backlog of articles to review, I took this opportunity to write on this
blog about a subject nobody wants me to talk about: patterns for project failure.
We all heard the conferences, we all read the articles: we know how to ensure a project’s success.
After all, software engineering is a tried and true discipline as old as civil engineering. Julius
Caesar successfully designed, wrote, and deployed
his very own cipher back in the Roman empire, so we
should have this figured out by now. Today, let’s take a break from articles teaching us what to do
and let’s focus on something more entertaining: striving for failure.
Let me do what we always try to do on Star Trek: hopefully entertain you, perhaps even make you
laugh a couple of times. And when your guard is down, slip in a heavy idea or two…
– Gene Rodenberry, “Inside Star Trek”, 1979
Creating Dungeons & Dragons GPT With Coveo GenAI
As of December 15th 2023, GenAI is now GA with Coveo. A lot of the steps I wrote in this blog post are not needed anymore, as the flow is now much simpler.
You can find more information on Relevance Generative Answering (RGA) in our documentation.
As you may have heard, Coveo recently released its Generative Answering solution (also called “GenAI”). It’s been all the hype internally at Coveo, as well as externally with multiple customers and partners approaching us to play with this new product and implement it on their end.
Similarly, there’s recently been a lot of hype about the release of Baldur’s Gate 3, the Larian Studios video game based on Wizards of the Coast’s classic TTRPG Dungeons and Dragons.
As a big fan of D&D and a big fan of new tech, I thought it would be a great idea (and a great way to sink my teeth into a new Coveo product) to create a Coveo GenAI-powered bot that can answer questions about Dungeons and Dragons.
How we got to an Active-Active production environment in the US
In the past few years, our cloud service provider, AWS, has been overall pretty reliable.
But like everything in life, nothing is perfect and as Werner Vogels (CTO of Amazon) repeated many times - “Everything Fails All the Time”.
Over the years, we have seen regional outages affecting a subset of the services that we leverage.
When those outages occurred, we often relied on another AWS region to quickly spin something up that allowed us to continue delivering our services.
About 4 years ago, we delivered our multi-regions feature to reduce latency, and just recently we worked towards leveraging those regions in an active-active way for our search infrastructure.
The main driver was to improve resiliency and to handle those outages the same as any given Tuesday.
Error Handling Tradeoffs and Crashing in Production
There are only two hard things in Computer Science: memory problems, error handling, and of course off-by-1 errors.
For years, I’ve felt uncertain about what to do when something unexpected happens in a program I wrote.
Should I return an error code, crash, crash in debug builds only, throw an exception…
This uncertainty lit up my curiosity, and slowly, while I accumulated the years of experience, I became more aware of the tradeoffs behind each strategy.
Because, of course, the answer is, as always, it depends.
Temporary privileges as a service, a nice engineering challenge
The Coveo infrastructure is constantly growing. DevOps engineers add new regions and services, which leads to more systems that can break, more complex access management, and more complex audit logging. If I tell stakeholders that the entire R&D department needs always-on access to all the services they deploy and own in a production environment, some of those stakeholders will tell me that the risks are too high and that it is not acceptable. On the other hand, if only a handful of people can help when there is an incident in production, the on-call access management person will have to be woken up every time an engineer needs access to a specific resource. This makes access management unhappy, and increases the time to resolution, potentially even causing a breach of our service level agreement. Leadership won’t like that.
This is why Coveo needed a good middle ground. The R&D department needed a system that allowed selected employees to gain privileged access on systems they own for a short period of time, fix the incident, and follow up with a post-mortem. Back in 2020, Coveo adopted strongDM to manage privileged access rights. While it already supported granting temporary privileges, it lacked a way to allow employees to quickly request a temporary privilege, without waking up the strongDM administrator at 3 AM. From the strongDM APIs, the R&D Defense team built that system.