Skip to main content

Llama Stack and Why it Matters

·679 words·4 mins
llama stack

As part of my work with Llama Stack, I regularly answer questions from co-workers, community members, and industry peers about what Llama Stack is, what value it provides, and why it’s necessary for broad enterprise adoption of generative AI. So, let’s dig into these topics.

What is Llama Stack?
#

Llama Stack is an open source generative AI platform. It provides the common APIs needed by generative AI applications, such as inference, agents, tool calling, embeddings, vector storage, files, safety, evaluations, and model fine-tuning. The implementations of these APIs are pluggable, allowing Llama Stack to run on a single laptop with lightweight components or as a large scale distributed system.

Llama Stack also provides a set of client libraries in multiple programming languages to make it easy to get started creating applications. Most of these APIs match the OpenAI API specification, which means the OpenAI SDKs and many other popular client libraries also work with Llama Stack.

And, despite the “Llama” in the name, Llama Stack works with any model that can run in vLLM, Ollama, or a number of other inference servers and services.

What value does Llama Stack provide?
#

Standardization on the completions and chat completions inference APIs has created a broad ecosystem of libraries, applications, and inference servers. However, generative AI applications frequently need to go beyond simple inference to deal with things like retrieval-augmented generation (RAG), tool calling, Model Context Protocol (MCP), agents, safety guardrails, and so forth. Before Llama Stack, applications that need any of these capabilities were required to use vendor-specific APIs.

With Llama Stack, these generative AI applications are able to use a common set of APIs that can run anywhere from a laptop to an enterprise data center to a fully-managed cloud service. And, by adopting a client-server architecture for generative AI applications, the clients are able to run on any type of hardware with minimal resource requirements while the servers providing these APIs are able to serve multiple clients for efficient utilization of costly AI accelerators.

Why is Llama Stack necessary for enterprises?
#

Enterprises need vendor choice and customizability to fulfill their regulatory, privacy, and budgetary needs. Anyone can write new providers for Llama Stack APIs, optionally contributing them back upstream to the Llama Stack community. This allows vendors to assemble their preferred set of implementations for each of the APIs that works best with their hardware, software, and partners. Administrators are able to integrate these Llama Stack distributions into their authentication, authorization, management, and monitoring stacks for the control they need while still providing the APIs expected by their developers.

Kubernetes is a great example where the network effects of a common set of APIs for containerized applications led to an entirely new ecosystem for application developers, enterprise software vendors, and cloud providers.

Llama Stack is repeating this pattern for generative AI applications. Application developers can easily get started with cloud services or on their laptop and directly transfer those applications and expertise to enterprise environments. Independent software vendors (ISVs) can create generative AI applications with confidence that they can run anywhere the customer needs. Client libraries and frameworks now have a broader set of common APIs to build on top of, enabling them to focus on their key innovations instead of low-level concerns like supporting multiple vector stores or how to discover and call tools.

Llama Stack allows enterprises to run their own generative AI platform without having to write that platform themselves. They can get those expensive AI accelerators fully utilized by choosing their preferred Llama Stack distribution (from companies like Red Hat, Dell, and NVIDIA or creating their own from upstream) and immediately start moving cloud AI applications into their on-premise data center.

Want to talk about Llama Stack?
#

Feel free to reach out to me directly via bbrowning in the Llama Stack Discord or any of my other contact details listed on this site.

Want to contribute?
#

Have a feature request? Hit a bug? Want to help shape this transformation of enterprise AI? Join us in the Llama Stack GitHub repository.

Benjamin Browning
Author
Benjamin Browning
I’m an open source software engineer at Red Hat.