What is a cold start in serverless inference and why does it impact real-time performance?
A cold start is the delay when the cloud platform has to spin up a brand-new runtime container to handle an incoming request. This involves downloading your code, loading the entire ML model into memory, and starting the inference engine. This delay typically takes 2 to 10 seconds, which completely violates real-time latency requirements of 100-500 milliseconds, especially during traffic spikes when requests exceed pre-warmed instances.