MLOpsIntermediate6h

Model serving.

Wrapping a model in an API and serving predictions.

What is model serving?

Model serving is exposing a trained model behind an API so applications can send inputs and get predictions. A model sitting in a notebook helps nobody; serving turns it into a usable service with an endpoint, request handling, and a response format.

Why it matters

The whole point of a model is to make predictions in the real world, and serving is the step that delivers that value. It is also where ML meets backend engineering — latency, scaling, and reliability suddenly matter. Being able to ship a model as a service is what makes you useful beyond research.

What to learn

  • Wrapping a model in a web framework like FastAPI
  • Loading the model once at startup, not per request
  • Input validation and output formatting
  • Batch versus real-time inference
  • Latency and throughput basics
  • Health checks and readiness
  • Versioning the served model

Common pitfall

Loading the model from disk inside the request handler, so every prediction pays the slow load cost. The model should be loaded once when the service starts and reused across requests. Loading per request can turn a millisecond prediction into a multi-second one and crush throughput.

Resources

Primary (free):

Practice

Wrap a trained model in a FastAPI service: load it once at startup, expose a predict endpoint that validates input and returns formatted output, and add a health check. Containerize it with the Docker skills from the previous node. Done when the model loads once and serves repeated requests fast.

Outcomes

  • Serve a model behind a validated API endpoint.
  • Load the model once at startup, not per request.
  • Choose batch or real-time inference for the use case.
  • Add health checks and version the served model.
Back to AI / ML roadmap