Tech Blog Menu

Tech Blog

The Curious Case of a Service-level Objective

Written By
Jean-François Smith
The context The site reliability engineering (SRE) team at Coveo is currently hard at work implementing tools and processes with a lofty goal in mind: moving our existing monitoring culture in R&D toward the systematic use of service-level objectives (SLO). Writing blogs about SLOs or announcing products making use of them is pretty common nowadays, and understandably so. Yet I’m finding that most of the discourse around this topic is limited to the same kind of examples and use cases. In this blog post, I will tell the convoluted story of a definitely unconventional SLO.

Keeping our data pipelines under watch and on good behavior

Written By
Alexis Chicoine
Introduction Coveo’s data platform team is responsible for ingesting analytics data and making it available to internal teams as well as to customers. Over the last few years, we’ve matured in our practices, adding a lot more tests, resiliency to transient errors, and monitoring to our data pipelines. As a next logical step, we wanted to measure and visualize the rate at which we meet or break our service-level objectives. This article will cover the importance of service-level objectives and stability as well as the technical aspects of how we were able to measure them.

Coveo Blitz, où il faut développer à la vitesse de l'éclair

Written By
Marie Beaulieu
En janvier dernier avait lieu la 14ᵉ édition de Coveo Blitz, notre compétition annuelle de programmation pour étudiants. Ceux qui sont familiers avec l’évènement reconnaîtront la formule des dernières années : on y présente un jeu de notre cru, puis les participants disposent de 10 heures pour programmer un bot qui saura y jouer et triompher dans des matchs de 2 ou 4 équipes. Cette année, notre défi revêt le thème de l’espace : chaque équipe est aux commandes de l’équipage d’un vaisseau qui doit affronter d’autres équipes afin d’être la dernière survivante. Nos concepteurs ont concocté un jeu qui était à mon sens l’un des plus sophistiqués, mais aussi l’un des plus complexes des dernières éditions : il y avait une grande variété d’actions possibles, et donc de stratégies à explorer, sans compter certains défis techniques au niveau de l’implémentation. J’aimerais dès lors explorer certaines des stratégies employées lors de la dernière compétition, mais plus important, comment en tant que joueur, on peut s’attaquer à ce genre de défi.

Patterns for project failure

Written By
Nicolas Juneau
Introduction My name is Nicolas Juneau and I am Coveo’s CFO (Chief Failure Officer). As the blog review team has yet to unpack a huge backlog of articles to review, I took this opportunity to write on this blog about a subject nobody wants me to talk about: patterns for project failure. We all heard the conferences, we all read the articles: we know how to ensure a project’s success. After all, software engineering is a tried and true discipline as old as civil engineering. Julius Caesar successfully designed, wrote, and deployed his very own cipher back in the Roman empire, so we should have this figured out by now. Today, let’s take a break from articles teaching us what to do and let’s focus on something more entertaining: striving for failure. Let me do what we always try to do on Star Trek: hopefully entertain you, perhaps even make you laugh a couple of times. And when your guard is down, slip in a heavy idea or two… – Gene Rodenberry, “Inside Star Trek”, 1979

Creating Dungeons & Dragons GPT With Coveo GenAI

Written By
Alexandre Moreau
As of December 15th 2023, GenAI is now GA with Coveo. A lot of the steps I wrote in this blog post are not needed anymore, as the flow is now much simpler. You can find more information on Relevance Generative Answering (RGA) in our documentation. As you may have heard, Coveo recently released its Generative Answering solution (also called “GenAI”). It’s been all the hype internally at Coveo, as well as externally with multiple customers and partners approaching us to play with this new product and implement it on their end. Similarly, there’s recently been a lot of hype about the release of Baldur’s Gate 3, the Larian Studios video game based on Wizards of the Coast’s classic TTRPG Dungeons and Dragons. As a big fan of D&D and a big fan of new tech, I thought it would be a great idea (and a great way to sink my teeth into a new Coveo product) to create a Coveo GenAI-powered bot that can answer questions about Dungeons and Dragons.

How we got to an Active-Active production environment in the US

Written By
Kevin Larose
In the past few years, our cloud service provider, AWS, has been overall pretty reliable. But like everything in life, nothing is perfect and as Werner Vogels (CTO of Amazon) repeated many times - “Everything Fails All the Time”. Over the years, we have seen regional outages affecting a subset of the services that we leverage. When those outages occurred, we often relied on another AWS region to quickly spin something up that allowed us to continue delivering our services. About 4 years ago, we delivered our multi-regions feature to reduce latency, and just recently we worked towards leveraging those regions in an active-active way for our search infrastructure. The main driver was to improve resiliency and to handle those outages the same as any given Tuesday.

Error Handling Tradeoffs and Crashing in Production

Written By
Kevin Lalumiere
There are only two hard things in Computer Science: memory problems, error handling, and of course off-by-1 errors. For years, I’ve felt uncertain about what to do when something unexpected happens in a program I wrote. Should I return an error code, crash, crash in debug builds only, throw an exception… This uncertainty lit up my curiosity, and slowly, while I accumulated the years of experience, I became more aware of the tradeoffs behind each strategy. Because, of course, the answer is, as always, it depends.

Temporary privileges as a service, a nice engineering challenge

Written By
Jean-Philippe Lachance
The Coveo infrastructure is constantly growing. DevOps engineers add new regions and services, which leads to more systems that can break, more complex access management, and more complex audit logging. If I tell stakeholders that the entire R&D department needs always-on access to all the services they deploy and own in a production environment, some of those stakeholders will tell me that the risks are too high and that it is not acceptable. On the other hand, if only a handful of people can help when there is an incident in production, the on-call access management person will have to be woken up every time an engineer needs access to a specific resource. This makes access management unhappy, and increases the time to resolution, potentially even causing a breach of our service level agreement. Leadership won’t like that. This is why Coveo needed a good middle ground. The R&D department needed a system that allowed selected employees to gain privileged access on systems they own for a short period of time, fix the incident, and follow up with a post-mortem. Back in 2020, Coveo adopted strongDM to manage privileged access rights. While it already supported granting temporary privileges, it lacked a way to allow employees to quickly request a temporary privilege, without waking up the strongDM administrator at 3 AM. From the strongDM APIs, the R&D Defense team built that system.

Accelerate your Maven CI builds with distributed named locks using Redis

Written By
Jacques-Etienne Beaudet
Everybody loves a fast, responsive continuous integration (CI) setup. While it’s fun to battle each others on rolling chairs, having quick feedback on pull requests builds and fast deployments is an important part of a good development environment and is also something we constantly have to invest time in. In this post, we will show you a neat way to safely use your local Maven repository across multiple build processes using the Named Locks feature of Maven Artifact Resolver. This will speed up your concurrent builds by reducing the required time to download dependencies and will also help maximize your CI instances usage.

Improving download speed of a S3 proxy in Go

Written By
Andy Emond
There is an old adage that says, “Hardware eventually fails. Software eventually works.” As the Coveo Blitz competition is approaching, we are ramping up the platform to receive more than 135 participants who will produce software bots that fight against one another for the prestigious Coveo Cup. One of the cool features of the platform, that we’ve created and improved over the years, is the ability to visualize past fights by downloading a replay file. But this year, like every year, was different. This year, the replays were up to 50MiB large and the download time was suffering from it. By suffering I mean 50 seconds to download 40MiB; waiting for a mp3 to download in the 2000’s kind of suffering. The replays are stored in S3, proxied by our servers written in go, so let’s see what is going on in the go code, and maybe it will eventually work!
< Older