2024-08-02

A minor release resulted in a cascade of errors that perturbed platform stability and customer service. How it went wrong,what we’ve learnt

On Friday, August 2nd, 2024 Clever Cloud’s platform became unstable, leading to downtime of varying duration and scope, for many customers. The impact was both on the API itself, and on some services running as add-ons or runtimes.

These events occurred because:

A planned minor maintenance of the Messaging component has locked the cluster causing an incident (www.clevercloudstatus.com/incident/880) between 2:00 PM CEST and 2:29 PM CEST
As an indirect consequence, the infrastructure experienced I/O and memory pressure between 3:40 PM CEST and 4:16 PM CEST causing some hypervisors to crash.

Impacts were observed for customers relying on the EU-FR-1 (PAR) region, and remote zones depending on the EU-FR-1 control plane (OVHcloud, Scaleway, and Oracle). Privates and on-premise zones weren’t impacted.

Product	Impact
Runtimes	Customers with applications running on impacted hypervisors may have seen performance degradation. Customers applications traffic served through impacted load balancers also may have suffered from traffic issues.
Runtimes deployment	Deployment API was unreachable during the incident, leading to new deployment or upscale/downscale operations not to succeed.
Databases (MySQL, PostgreSQL, Redis, MongoDB)	Customers with databases running on impacted hypervisors may have seen performance degradation or even unavailability. No data has been lost during the incident.
Cluster services (Elastic, Couchbase, Cassandra, …)	Depending on both infrastructure pressure and the running technology, some nodes may have survived the incident, hence the provided service may have been fulfilled. We know for sure, some clusters had issues during the incident, others didn’t.
VPN / IPSec	Most of the VPN endpoints were unavailable during the incident.
Observability	The service has been unavailable but no data has been lost.

Timeline

Timeline	Component	Description
2024-08-02 10:06 CEST	Messaging	Clever Cloud data team starts a minor rolling maintenance upgrade of the Apache Pulsar component to release 3.3.1. This maintenance also enables a new balancer algorithm for bundles placement optimization.
10:08 CEST	Messaging	Monitoring shows connection issues to the messaging platform.
10:09 CEST	Messaging	Engineers confirm connections issues to produce data into the event bus.
10:10 CEST	Messaging	Messaging issues impact agents on hypervisors, log forwarders from load balancers and deployments — Impact starts —
10:11 CEST	Messaging	Monitoring graphs show the backlog ramping up and no new producers / consumers to succeed. (backlog means consumption isn’t streamlined with production).
10:13 CEST	Messaging	Monitoring starts to show a recovery state - reads happening again - Cluster convergence.
10:18 CEST	Messaging	Monitoring shows no writes anymore. Backlogs are stalled and no reads from the event bus.
10:20 CEST	Messaging	The cluster is considered defective.
10:58 CEST	Messaging	Engineering team understands that the new parameter is conflicting in a transparent rolling upgrade. They're rolling back the parameters.
11:10 CEST	Deployments	Customers can no longer deploy their application
11:36 CEST	Messaging	Messaging Brokers fail to restart. A conflictual metadata blocks the reload. Engineering team shuts down client proxies to release brokers.
11:36 CEST	Messaging	Monitoring shows "Lookup request" queue is full on all brokers (50k pending requests).
11:46 CEST	Messaging	An unassigned topic fails to load (unloaded state prevents it to be manipulated from admin commands, like deletion).
11:48 CEST	Deployments	Engineering team behind deployments looks for a countermeasure to overcome messaging issues.
11:50 CEST	Deployments	As a workaround, they enable notification events to switch on RabbitMQ.
12:22 CEST	Messaging	All brokers have been shut down. Incriminated topics are successfully loaded and deleted. All brokers are now being started progressively.
12:28 CEST	Messaging	All brokers have been started.
12:30 CEST	Infrastructure	Lots of hypervisors have their CPU load drastically increase. Almost all hypervisors of all regions are impacted. Some hypervisors are more impacted than others.
12:31 CEST	Deployments	The Deployments workaround is running successfully. Deployments are ok in theory, but fails to find enough resources to consume the queues.
12:35 CEST	Deployments	Load balancers can't sync with Deployments API due to Messaging issues. This leads to unsynced load balancers configuration for moving applications deployments (scale up/down, new deployments, etc.).
12:35 CEST	Infrastructure	Monitoring alerts "HV Agent has too many tasks" start to raise. From this point, multiple hypervisors may randomly kill VM management processes.
12:36 CEST	Load Balancers	Some load balancers are among VM killed and watchdog monitoring components reload every impacted service exposed through the load Balancers.
12:38 CEST	Core APIs	The main API components crashed due to out-of-memory (OOM) errors.
12:39 CEST	Infrastructure	Watchdog is being stopped to prevent it from reloading all failed components. Engineers prefer to understand what's happening and start components one by one.
12:42 CEST	Network	Some private networks are flapping, some services like the observability platform become unavailable for a couple of minutes.
12:47 CEST	Infrastructure	Hypervisor (1) reboot (kernel panic - hung VM tasks on interacting with KVM API).
12:48 CEST	Infrastructure	Hypervisors (3) reboot (kernel panic - hung VM tasks on interacting with KVM API or ext4 filesystem).
12:50 CEST	Observability	Metrics API ingestion is disabled to preserve the underlying storage cluster. Buffering ensures ingestion availability.
12:51 CEST	Infrastructure	Hypervisor (1) reboot (kernel panic - hung VM tasks on writing to ext4 disks).
12:54 CEST	Messaging	Messaging proxies are restarted. Connections are recovering and production / consumption of messages is ok. — Messaging OK —
13:05 CEST	Load Balancers	The Clever Cloud API is down and the admin tools as well.
13:10 CEST	Core APIs	Main API starts to consume events from the workaround queue.
13:30 CEST	Infrastructure	Clever Cloud engineers identify too many CPU load on hypervisors caused mostly by small VMs (pico/nano)
13:30 CEST	Observability	Few cluster critical nodes are not reachable, reporting "the database may have data loss and availability loss", it is in a locked state. Engineers decided to shut down the whole cluster for proper restart later.
13:35 CEST	Observability	A global maintenance is set on all monitoring instances on all regions to avoid them sending too many monitoring events and overloading orchestrator actions. No more monitoring actions are sent (upscale, downscale, unreachable, disk, etc.).
13:35 CEST	Infrastructure	Clever Cloud engineers notice too many disk reads on hypervisors, mostly small VMs (pico/nano). They think it might be related to Vector not being able to push VMs logs to pulsar, triggering small VMs to SWAP
13:44 CEST	Core APIs	The Clever Cloud API is reading and acknowledging events from deployments from 100k at 12:50 CEST to 8.5K
13:49 CEST	Core APIs	The Clever Cloud API has finished processing deployment events.
13:50 CEST	Infrastructure	Some hypervisors (12) are flagged to drain some workload out
13:50 CEST	Load Balancers	Shutdown is performed on some load balancers to free CPU and memory usage as there are others instances that still need resources.
14:00 CEST	Core APIs	Core services are being recovered and deployments are being reloaded. This will synchronize back load balancers for customer's applications trying to reach their new deployments.
14:08 CEST	Core APIs	Some zombies VM are being cleaned to accelerate the recovery process.
14:38 CEST	Load Balancers	We have gathered information from load balancers configurations to verify the VM states of the back office admin interface.
14:39 CEST	Core APIs	Critical core services are OK. Deployments are being rolled out.
14:50 CEST	Core APIs	Admin interface has been moderated instead of stopped by human error. Quickly fixed.
15:03 CEST	Load Balancers	The back office admin interface is up and running.
15:07 CEST	Deployments	Orchestrator instruction queues still have difficulties to be processed. We keep receiving a lot of Monitoring actions even while the monitoring instances are in maintenance mode.
15:40 CEST	Infrastructure	Hypervisors have experienced reboots around 12:48 CEST. Recovery process is occurring and will take a couple of minutes.
15:45 CEST	Observability	We clear the monitoring instance alert queue to drop monitoring events that were still queued to avoid flooding the orchestrator instructions.
15:56 CEST	Core APIs	Some private Wireguard networking meshes seem still experiencing network issues.
16:00 CEST	Deployments	We observe that some ochestrator actions are taking multiple seconds (+30s in some cases), slowing down the orchestration operations.
16:05 CEST	Deployments	A RabbitMQ node is on an overloaded hypervisor and is thus overloaded too. It slows down the cluster used by orchestrator.
16:05 CEST	Infrastructure	Infrastructure engineers delivered new hypervisors on the EU-FR-1 region that were drained then parked for upgrade, to add capacity to help other overloaded hypervisors.
16:15 CEST	Deployments	A RabbitMQ node is removed from the cluster because the node is overloaded and disturbs the orchestration. orchestrator is now correctly working.
16:16 CEST	Core APIs	Customer applications are being deployed to scale up deployments.
16:19 CEST	Load Balancers	One ADC has kernel panic and it is unable to boot.
16:19 CEST	Load Balancers	Admin ADCs are running and the configuration is synchronized. A rule is removed on load balancers to prevent customers from deploying through the console.
16:24 CEST	Core APIs	In the meantime, we continue to identify noisy VMs that have been impacted by the Pulsar outage.
16:33 CEST	Observability	Attempt to restart the Metrics API - Wireguard mesh networking issue encountered due to stale conntrack.
17:05 CEST	Infrastructure	Previously flagged HV are back to schedule Runtimes VM.
17:06 CEST	Load Balancers	Previously shut ADC are now up and running.
17:06 CEST	Load Balancers	Every regional L7LB has been rolling restarted.
17:07 CEST	Load Balancers	Begin the rolling restart of every L4 load balancers.
17:14 CEST	Deployments	Some load balancer virtual machine are still unreachable, The HV agent is restarted and the load balancer as well.
17:28 CEST	Observability	The FoundationDB cluster powering the Metrics API is now healthy.
17:30 CEST	Load Balancers	Every L4LB has been rolling restarted.
17:30 CEST	Load Balancers	Every dedicated L7LB has been rolling restarted.
17:32 CEST	Observability	Metrics API directories are started and warms up.
17:33 CEST	Load Balancers	Some customers endpoints may be not responsive, engineering and support teams up to investigate.
17:49 CEST	Load Balancers	These endpoints are now back online.
18:00 CEST	Observability	Cleanup of EU-FR-1 region Monitoring instance duplicate hosts.
18:13 CEST	Observability	Monitoring PAR6 AZ back in production and sending monitoring events.
18:19 CEST	Observability	Monitoring PAR7 AZ back in production and sending monitoring events.
18:20 CEST	Core APIs	All deployments are now being rolled out. — Deployments ok —
18:22 CEST	Load Balancers	Access logs productions is ok, processing pipeline is starting.
18:24 CEST	Observability	Monitoring PAR8 AZ back in production and sending monitoring events.
18:30 CEST	Infrastructure	Still a few hundreds of VMs are consuming very high CPU rates and being cleaned.
18:30 CEST	Customers	We ensure premium customers not having applications without VMs instances.
18:35 CEST	Core APIs	We estimate approximately 1h30 to have fully recovered all deployment of applications.
18:36 CEST	Infrastructure	All Hypervisors have recovered to a nominal load
18:36 CEST	Observability	Maintenance is removed on Monitoring, we start receiving oncall alerts again.
18:39 CEST	Customers	We ensure non-premium customers not having applications without VMs instance.
18:40 CEST	Deployments	We restart all HV agents on the infrastructure as it suspects communication errors between agents on hypervisors and control plane.
18:40 CEST	Core APIs	We restart add-ons providers apis which seem to be unable to communicate with the control plane event bus.
18:53 CEST	Deployments	All dangling virtual machines in stopping state have been destroyed and application are deploying.
19:05 CEST	VPN - IPSec	All IPSec links should be back online.
20:45 CEST	Observability	Metrics API directories are ready to serve.
20:50 CEST	Observability	Metrics API queries are enabled.
20:50 CEST	Observability	Telemetry confirmed that we are processing access logs
20:51 CEST	Observability	Metrics API ingest is enabled with a ramp up QoS policy.
22:53 CEST	Observability	Metrics API ingestion policy is disabled, the cluster is fully functional.
22:53 CEST		All impacts resolved — IMPACT END —

Analysis

Messaging Maintenance (Apache Pulsar)

Apache Pulsar is used as the fundamental layer of messaging. This component has interesting multi-tenancy capabilities and supports very high cardinality of namespace/topics creation. It’s not exempt of bugs, and it regularly receives a fair amount of bug fixes. Minor releases are applied as a rolling upgrade basis. When it happens, topics are handed over from a broker to another. Since the architecture of Apache Pulsar has strong separation between the messaging layer and the storage layer, brokers are stateless and the topic handover is a seamless operation that doesn’t affect connected clients. They will receive a notice that the topic has moved to another broker, and connect to the new one. This is an expected behavior implemented by clients and constantly happening for broker load management, resizing, etc. In our topology, there are additional proxies between customers’ clients and brokers, so only proxies are seeing this administrative rebalancing.

In this minor maintenance, we opt in for an optimized balancer named TransferShedder instead of the native OverloadShedder.

loadManagerClassName=org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl

loadBalancerLoadSheddingStrategy=org.apache.pulsar.broker.loadbalance.extensions.scheduler.TransferShedder

Pulsar scalability is performed through the concept of bundles which are kind of scalability groups to perform administrative tasks. This change also implies moving from the modular bundle unloading strategy to the extensible one. These changes bring many improvements by removing storage on the metadata layer, and having a pre-assignation for the next broker when unloading a topic which shortens the switch from a broker to another.

This new extensible unload strategy relies on non-persistent real time replication among brokers. Which means when a first broker is upgraded, it initializes the system topic to synchronize other brokers with load metadata. Unfortunately, the extension failed to initialize properly, and doesn’t create the system topic, but without crashing the broker which will be seen up by our monitoring :

2024-08-02T08:48:53,344+0000 [CompletableFutureDelayScheduler] WARN  org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl - clevercloud-pulsar-broker-c3-n1:8080 failed to wait for owner for serviceUnit:orga_79bb9459-98d7-48bb-91ea-cd64204d8140/logs/0x80000000_0xc0000000; Trying to return the current owner:Optional[clevercloud-pulsar-broker-c3-n1:8443] java.util.concurrent TimeoutException: null
2024-08-02T08:48:53,344+0000 [CompletableFutureDelayScheduler] ERROR org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl - clevercloud-pulsar-broker-c3-n1:8080 failed to get active owner broker. serviceUnit:orga_79bb9459-98d7-48bb-91ea-cd64204d8140/logs/0x80000000_0xc0000000, state:Owned, owner:Optional[clevercloud-pulsar-broker-c3-n1:8443]
java.util.concurrent.CompletionException: java.lang.IllegalStateException: The new owner clevercloud-pulsar-broker-c3-n1:8443 is inactive. Caused by: java.lang.IllegalStateException: The new owner clevercloud-pulsar-broker-c3-n1:8443 is inactive.

Since the monitoring didn’t catch it, the broker was up so the rolling process continued. The more it started, the more they tried to reach the topic metadata, fulfilling the Lookup requests queue :

2024-08-02T08:49:12,448+0000 [pulsar-io-3-5] WARN  org.apache.pulsar.client.impl.MultiTopicsConsumerImpl - Failed to get partitions for topic to determine if new partitions are added
java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$TooManyRequestsException: {"errorMsg":"Failed due to too many pending lookup requests","reqId":1946266099531305237, "remote":"clevercloud-pulsar-broker-c3-n9/192.168.2.9:6650", "local":"/192.168.2.9:58654"} Caused by: org.apache.pulsar.client.api.PulsarClientException$TooManyRequestsException: {"errorMsg":"Failed due to too many pending lookup requests","reqId":1946266099531305237, "remote":"clevercloud-pulsar-broker-c3-n9/192.168.2.9:6650", "local":"/192.168.2.9:58654"}

Ultimately, the topic has never been properly created and brokers ended up in a state where no topics were loaded :

2024-08-02T09:16:23,828+0000 [pulsar-io-3-7] WARN  org.apache.pulsar.client.impl.BinaryProtoLookupService - [non-persistent://pulsar/system/loadbalancer-broker-load-data] failed to get Partitioned metadata : {"errorMsg":"Failed to get partition metadata","reqId":2658063197115238188, "remote":"clevercloud-pulsar-broker-c3-n1/192.168.2.1:6650", "local":"/192.168.2.1:34522"} org.apache.pulsar.client.api.PulsarClientException$BrokerMetadataException: {"errorMsg":"Failed to get partition metadata","reqId":2658063197115238188, "remote":"clevercloud-pulsar-broker-c3-n1/192.168.2.1:6650", "local":"/192.168.2.1:34522"}
2024-08-02T09:16:24,346+0000 [broker-client-shared-scheduled-executor-7-1] WARN  org.apache.pulsar.client.impl.PulsarClientImpl - [topic: non-persistent://pulsar/system/loadbalancer-broker-load-data] Could not get connection while getPartitionedTopicMetadata -- Will try again in 795 ms 
2024-08-02T09:16:24,348+0000 [pulsar-io-3-15] ERROR org.apache.pulsar.broker.service.ServerCnx - non-persistent://pulsar/system/loadbalancer-broker-load-data [id: 0xa5782645, L:/192.168.2.1:6650 - R:/192.168.2.1:34522] [SR:192.168.2.1, state:Connected] Failed to get partition metadata java.util.concurrent.CompletionException: java.lang.IllegalStateException: The registry already closed.Caused by: java.lang.IllegalStateException: The registry already closed.

When we noticed the fall of the bandwidth on the platform, we understood that no topics were loaded, we rolled back the modification on the unload strategy but we decided to stick with the 3.3.1 release.

Switching back to the old policy is a quick move. Configuration revisions are versioned, and this one in particular was pretty simple, yet we faced a new problem. The new unload strategy created a system topic that was unloadable by brokers because of a metadata conflict. The topic wasn’t assigned to any brokers, and it was a showstopper to initialize: Brokers wouldn’t start… We ended up stopping all brokers to start only one, preventing metadata conflict between them, then it succeeded to load the topic, so that we could finally delete it, unlocking all brokers start process. After that, in less than 10 minutes, all brokers were back to a nominal situation so we started proxies.

It’s worth mentioning that for non minor releases, we use a simulation process that emulates a full environment where we inject changes. This validates upgrades and changes by observing the infrastructure behavior before going to production. This process wasn’t used for minor maintenance upgrades or (apparently) small changes in the configuration profile.

Hypervisor IO/CPU pressure

The unavailability of the Apache Pulsar messaging layer has led to the buffering of telemetry agents on the Virtual Machines (VMs) running on Hypervisors (Servers). Customers VMs embed an observability agent named Vector (https://vector.dev/) we chose for its ability to be very efficient and safe (being written in safe Rust). These agents started to buffer in memory while Messaging service endpoints weren’t available. Even if the format is rather efficient, for small VMs it reached a limit of memory which triggered a lot of memory pressure on the kernel. When it happens, the kernel tries to flush all possible memory, wipes all disk caches, and ultimately remove processes from memory to load them instruction per instruction from disk.

At this stage, this is generating a lot of IO disk pressure on the underlying hypervisor (we noticed ~25Gbps on disks). These small VMs are spreaded globally among our Availability Zones, so we ended up having an overloaded server infrastructure.

Clever Cloud also runs on top of Clever Cloud, which usually has many benefits. In this case, it turned out to be a chicken and egg situation : we needed the scheduler to perform some massive actions and comply with deployments, but the control plane was performing suboptimally and the infrastructure struggled to provide available resources.

Hypervisor crashes — Kernel Panics

A few minutes after the hypervisors got their load increased, we started to see kernel panics on some of our host hypervisors. Those kernel panics were triggered by the khungtaskd kernel thread. This watchdog thread, living inside the Linux kernel, is responsible for watching all tasks the kernel is handling. It iterates over all the tasks and checks all tasks in the TASK_UNINTERRUPTIBLE state.

The TASK_UNINTERRUPTIBLE state means that a task can’t be interrupted. This is the D state you can see in ps or htop. For example, you can get this state by reading or writing to the disk but there are also other syscalls that can put your process in this state.

If the task is in this state for too long (default is 120s), then khungtaskd dumps a trace of the task (which program caused it, its current callstack, ..) in the kernel ring buffer. In our case, we also chose to panic when we have programs in this state: it usually means the kernel either has a bug or something else is going wrong. This watchdog already helped us multiple times in the past to crash the host before it gets completely unreachable and unpredictable.

Some of our hypervisors had processes stuck in TASK_UNINTERRUPTIBLE for too long, for various reasons: writing to disk, KVM syscalls, .. An analysis of the source code ( https://elixir.bootlin.com/linux/v6.10.3/source/kernel/hung_task.c#L91 ) confirms our thoughts on the observed behavior.

Stuck processes leading to a Kernel Panic

Monitoring down → upscale

During the event, the monitoring infrastructure triggered a lot of events for both infrastructure and customer services. We observed almost x3 deployments as usual during that day. Multiple reasons can explain this observation:

The loss of hypervisors due to kernel panics made a lot of instances unreachable, leading to lots of deployments to honor the contract (e.g. number of minimum instances)
VMs that were under memory pressure were usually unresponsive and so they were (correctly) seen as unreachable
VMs that were on overloaded hypervisors might have also been slowed down due to the high CPU usage of the hypervisor. Those VMs had less CPU time to do their usual operations, making their load average increase. Our monitoring took that as an increase of CPU usage and they were ordered to upscale if they were configured to
Some of our monitoring instances took more time to do their checks because they were overloaded as well. Because the monitoring data was missing due to late checks, it was interpreted as lack of data and so as unreachable VMs, triggering more Monitoring/Unreachable events
Customers seeing their service unavailable are trying to trigger redeployment of their applications, which adds more scheduling tasks on the infrastructure

Our monitoring infrastructure, by observation of many key metrics (CPU Loads, CPU Steal, Availability, health checks, etc.) triggered a significantly higher rate of deployments, which at the time, wasn’t helping both to chill down the control plane already processing at the best it could and the overall infrastructure already overloaded.

Soothing the control plane

Once we understood some Virtual Machines (VMs) were inducing an abnormal high load on Hypervisors (HVs), we started to filter out machines that could help to free some resources. All non critical services have been shut down like the Observability platform: We stopped the Long Term Storage for Observability metrics data (our Warp10/FoundationDB cluster). Critical services benefit from In Memory Observability (Warp10 in memory standalones), so this was acceptable to preserve all available resources.

Overloaded VMs were collected, sorted and it appeared they shared a common pattern: they were small VMs with few memory resources. The reason has been covered a few lines above. The process after identifying them is obviously to kill them one by one by the SRE team. Nothing fancy here, but soon we started to regain control over all hypervisors.

Lessons & enhancements

We’ve learnt many lessons this Friday and we already identified many enhancements we will give to the platform. Sure this kind of incident occurring during a summer holiday period is tricky, but we escalated the problem quickly to call people on holiday as soon as 13:00 CEST. We identified two groups of enhancements. A first one, measurable, and identified as a key infrastructure management where some of the measurement are :

Pulsar Broker health
Pulsar Cluster health
Hypervisor load
Hypervisor I/O
Hypervisor reboot (Kernel Panic)
Zombies VM
RabbitMQ Broker health
RabbitMQ Cluster health
Deployment queue consumption

These measurements give us KPIs to measure where we’ve been operationally good, and where we have lacked some visibility or reaction. We qualified them with four inputs:

Time of event
Time of detection
Time of decision
Time of resolution

This factual approach helped to challenge ourselves on established rules and showed we can improve or rethink some part of the infrastructure.

Other topics raised but without KPI analysis:

Dealing with too many monitoring alerts
Backup Bastion — a regional bastion is vulnerable
Control Plane recovery process
Correlation between monitoring and Orchestration
Fault domain separation vs multi tenancy
Rethink for runtime images and the role of the hypervisor
Isolation for waved release upgrades (Pulsar)

All the above questions and KPIs convert to actionable actions for teams that you may question at any time.

Actions in progress

Following immediate actions, we created a lot of issues to avoid the same behavior in the future:

Runtime image redesign / better hypervisor integration
Regionalized observability for redundancy
Better profiling for small VMs :
- Vector configuration
  - buffer.max_events / buffer.max_size (must be at least 256Mo if type=disk)
  - buffer.type memory (default) → switch to disk
  - buffer.when_full block (default) → drop_newest
- Journald verbosity and SystemMaxUse
- Cgroups restrictions for better process control
- I/O restrictions for VM manager
Better zombies VM detection

We don’t face a lot of outages and big issues, because by design, we strive to do a good job of running production applications for our customers. As we grow, challenges grow, and we’re committed to achieve great production SLA because it’s our DNA, our mission: running your production workloads.

This experience will help us become even more reliable and we thank you for your help during this difficult time.

Last updated on October 23, 2024