<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Reliability on Jamal Yusuf</title><link>https://jamal.dev/tags/reliability/</link><description>Recent content in Reliability on Jamal Yusuf</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 26 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://jamal.dev/tags/reliability/index.xml" rel="self" type="application/rss+xml"/><item><title>GOMAXPROCS, CPU Limits, and the Kubernetes Trap That Silently Kills Go Throughput</title><link>https://jamal.dev/writing/gomaxprocs-kubernetes-cpu-limits/</link><pubDate>Fri, 26 Jun 2026 00:00:00 +0000</pubDate><guid>https://jamal.dev/writing/gomaxprocs-kubernetes-cpu-limits/</guid><description>&lt;p&gt;There is a particular kind of production mystery that appears only in containerized Go services.&lt;/p&gt;
&lt;p&gt;Your service is handling load fine on a development machine. You deploy it to Kubernetes with a sensible 500m CPU request and limit. The pod scheduled happily on a 16-core node. Under moderate traffic everything looks green in your dashboards.&lt;/p&gt;
&lt;p&gt;Then real traffic arrives. Latency climbs. Throughput plateaus well below what the node should be able to deliver. CPU utilization inside the pod hovers around 40-60% of the limit, yet the process feels starved. You add more replicas. The problem follows the pods.&lt;/p&gt;</description></item><item><title>Reliability Engineering for Generative AI Platforms</title><link>https://jamal.dev/writing/reliability-engineering-generative-ai/</link><pubDate>Mon, 22 Jun 2026 00:00:00 +0000</pubDate><guid>https://jamal.dev/writing/reliability-engineering-generative-ai/</guid><description>&lt;p&gt;The first time an AI agent I helped put into production caused a visible incident, it did not fail dramatically.&lt;/p&gt;
&lt;p&gt;It failed quietly.&lt;/p&gt;
&lt;p&gt;A claims adjustment agent, under load, began making decisions based on slightly stale eligibility data. The downstream payment system accepted the decisions. Finance noticed the variance three days later. By then we had processed thousands of incorrect adjustments.&lt;/p&gt;
&lt;p&gt;There was no stack trace. No obvious error rate spike. Just a slow, silent drift in the quality of context the agent was operating on.&lt;/p&gt;</description></item></channel></rss>