Everything about the AI Engineering Stack

1. The seven-layer model

The LLM dossier introduced a four-layer model (Model, Inference, Context/Identity, Platform/Tool) for understanding what happens when you type a message into a chat window. That model works well for its purpose.

This document expands to seven layers because we're mapping providers across the full stack, and the four-layer model bundles things together that providers treat as separate businesses. Connectivity (MCP, function calling) is a different product team from Memory (persistent context) at every major company. Agents have become their own category with dedicated frameworks, SDKs, and products. And infrastructure (chips, data centers, compute) determines who can train frontier models and who depends on someone else's hardware.

The seven layers used in this document. Provides a vocabulary for comparing providers.

The layers are not strictly sequential: an agent (Layer 6) uses tools (Layer 4) and memory (Layer 5) while running on a model (Layer 2) served by an inference engine (Layer 3). But the layering helps clarify what each provider actually builds and where they depend on someone else.

2. The providers

2.1 Google / Alphabet

Google has something at every single layer. No other company comes close to this breadth.

L1 Infrastructure: Designs its own TPUs (Tensor Processing Units), currently on TPU v6e (Trillium). Operates massive data centers. Google Cloud Platform (GCP) is one of the three major cloud providers. Does not depend on Nvidia for training; one of only two companies (alongside Apple for on-device) that designs its own AI silicon at scale.

L2 Models: Both closed and open-weight.

Closed: Gemini 3.x family (Gemini 3 Pro, Gemini 3 Flash). Powers the Gemini app and Google Workspace integration. Also now powering Apple's Siri overhaul under a multi-year deal reportedly worth ~$1B/year.
Open-weight: Gemma 4 family. Sizes from 2B to 27B+. Gemma 4 includes edge-optimized variants (E2B, E4B) that run on IoT devices and Raspberry Pi.
Specialized: PaLM family (legacy), Gemini Nano (on-device), various research models (DINOv3, SAM-related).

L3 Inference: Vertex AI (cloud platform for serving models), Google AI Studio (developer playground), LiteRT-LM for on-device inference on edge hardware. Gemma models run on all major local runtimes (Ollama, vLLM, LM Studio, llama.cpp). Google's TurboQuant algorithm (March 2026) enables 8x speedup in AI memory operations, cutting serving costs significantly.

L4 Connectivity: Supports MCP (joined the Agentic AI Foundation). Also created A2A (Agent-to-Agent protocol) for inter-agent communication. Gemini has native function calling/tool use. Google Workspace integration gives Gemini access to Gmail, Calendar, Drive, Docs natively.

L5 Memory: Gemini has built-in conversation memory with the ability to import memories from other AI services. On the developer side, Vertex AI Agent Engine includes Memory Bank, a managed service with sessions (short-term) and persistent memory bank (long-term). This is the most complete managed memory infrastructure any cloud provider offers.

L6 Agents: Agent Development Kit (ADK) is Google's agent framework. Vertex AI Agent Engine provides managed agent runtime. Google also created A2A for multi-agent communication. Gemini has agentic capabilities within Google Workspace (summarize email threads, schedule across calendars, query Drive).

L7 Applications: Gemini app (standalone), Gemini in Google Workspace, Google AI Studio (developer), and the Gemini Enterprise Agent Platform (announced from the Las Vegas stage at Google Cloud Next '26 on April 22, 2026, GA same day). The platform is the direct evolution of Vertex AI; all Vertex AI services and roadmap evolutions now ship exclusively through the Agent Platform. Combines model selection (200+ models in Model Garden including Gemini 3.1 Pro / Flash Image, Lyria 3, Gemma 4 open-weight, Claude Opus 4.7 / Sonnet / Haiku, plus open-source models), agent building, DevOps, orchestration, governance, optimization, and security. Google announced a $750M innovation fund for partners building agents on the platform. Plus Gemini powering Siri for Apple's 2B+ device install base.

The picture: Google is the only provider that designs its own chips, trains its own closed and open-weight models, runs its own inference infrastructure, has its own connectivity protocols, ships managed memory, has its own agent framework, and serves billions of end users across its own and partner (Apple) surfaces. The weakness: developer mindshare. Despite having everything, developers often default to OpenAI or Anthropic APIs and use Google's open-weight models (Gemma) on other platforms.

2.2 Microsoft

Microsoft doesn't make the frontier model. It owns the distribution and enterprise surface.

L1 Infrastructure: Azure is one of the three major cloud providers. Does not design its own AI chips (depends on Nvidia GPUs). Capex guidance for calendar 2026 is roughly $190B (including ~$25B impact from higher component pricing), a substantial step up from the earlier projections. Fairwater (Mount Pleasant, Wisconsin) is the flagship AI-first datacenter: $7.3B investment, contiguous floorplates designed to behave as a single supercomputer, very high GPU density, liquid cooling at scale; came online ahead of schedule and recognized revenue earlier than planned. Total AI capacity grew 80% in fiscal 2026, with plans to double the data center footprint within two years. Provides the compute infrastructure for OpenAI's training runs.

L2 Models: Open-weight only for own models.

Open-weight: Phi family (Phi-4, Phi-4-mini, Phi Silica for on-device). Small language models designed for efficiency, not frontier capability.
Closed models via partnerships: Hosts OpenAI models (GPT-5 family) and Anthropic models (Claude) through Azure. Microsoft's Copilot products use a mix of OpenAI models under the hood, with Anthropic models available in M365 Copilot as of March 2026.

L3 Inference: Microsoft Foundry (cloud inference platform), Foundry Local (on-device inference for Copilot+ PCs), Azure OpenAI Service, Azure AI Model Catalog. Windows AI APIs provide developer access to on-device inference via NPU acceleration on AMD/Qualcomm Copilot+ hardware.

L4 Connectivity: Adopted MCP directly in product UIs: Copilot Studio, Security Copilot, Foundry, VS Code, M365 agents. Azure API Center serves as an enterprise MCP registry. Microsoft is a supporting member of the Agentic AI Foundation (AAIF). Also has its own declarative agent framework for M365 with the Agents Toolkit in VS Code.

L5 Memory: Two distinct memory systems.

M365 Copilot Memory (GA July 2025): intent-based storage of user preferences across Word, Excel, PowerPoint, Outlook, Teams. Does not yet work with custom agents.
GitHub Copilot Memory (public preview, March 2026, default-on for Pro/Pro+): agentic, repository-scoped memory where Copilot agents automatically discover and store insights about codebases. Cross-agent (code review learns something, cloud agent uses it later). 28-day expiry with renewal on validation. User-level preferences added May 15, 2026 (early access for Pro and Pro+). The most interesting memory implementation from any major provider.
Also published memory poisoning research (Feb 2026) identifying the attack surface persistent memory creates (MITRE ATLAS AML.T0080).

L6 Agents: Copilot Studio (no-code/low-code agent builder), Agent 365 (GA May 1, 2026; manages agent lifecycle in enterprise; $15/user/month standalone), Copilot Cowork (GA May 1, 2026; runs on Anthropic's agentic harness; bundled with the M365 E7 SKU at $99/user/month). Microsoft Agent Framework 1.0 (GA April 3, 2026; merger of AutoGen + Semantic Kernel; .NET and Python; MCP-native).

L7 Applications: Microsoft 365 Copilot (enterprise), GitHub Copilot (developers), Bing Chat / Copilot (consumer), Windows Copilot, Edge Copilot. The widest enterprise distribution surface of any AI company. 400M+ Microsoft 365 users.

The picture: Microsoft's strategy is distribution, not model creation. They let OpenAI and Anthropic build the brains, then wrap them in enterprise surfaces (M365, GitHub, Azure) that reach more users than any competitor. The risk: dependency on other companies' models. The strength: nobody else has the enterprise distribution to make AI the default tool for 400M knowledge workers.

2.3 Meta

Meta is the open-weight champion that recently pivoted to closed-weight for its frontier model.

L1 Infrastructure: Building the Hyperion data center. Spending $115-135B in AI capex for 2026, nearly double 2025. Uses Nvidia GPUs (purchased 550,000+ GB200/GB300 units for training). Created Meta Compute as a top-level organization reporting directly to Zuckerberg.

L2 Models: Both closed and open-weight (new split in 2026).

Closed: Muse Spark (announced April 8, 2026), the first model from Meta Superintelligence Labs (led by Alexandr Wang). Originally code-named "Avocado" internally. Natively multimodal with tool use, visual chain of thought, and multi-agent orchestration. Deliberately closed-source, a pivot from the Llama open strategy. Powers the Meta AI app and meta.ai web today, with rollout to Facebook, Instagram, and WhatsApp planned. May 12, 2026 update added voice conversation to the Meta AI app (interrupt, switch topics, swap languages mid-stream, with inline image generation and Reels / Maps surfacing). The companion model originally codenamed "Mango" (next-generation image and video generation) is still upcoming in 2026.
Open-weight: Llama family (Llama 4 series). Remains the most widely deployed open-weight model family globally. Apache 2.0 license.
Also: behavioral foundation models for embodied humanoid virtual agents (research).

L3 Inference: Models hosted on Meta's own infrastructure for Meta AI. Llama runs everywhere (Ollama, vLLM, HuggingFace, AWS Bedrock, Azure). No standalone inference product for developers comparable to Vertex AI or Azure OpenAI.

L4 Connectivity: Supporting member of the Agentic AI Foundation (AAIF) for MCP. Meta AI has native tools (search, image generation). No standalone connectivity protocol or developer platform for tool integration comparable to what Google, Microsoft, or Anthropic offer.

L5 Memory: Acquired Limitless (the memory pendant/wearable company) in December 2025. Limitless's non-pendant features (screen, audio capture) were discontinued post-acquisition. Meta's built-in context includes 15+ years of Facebook behavioral data. No standalone memory framework or developer API for persistent memory.

L6 Agents: Muse Spark supports multi-agent orchestration (launching subagents in parallel). No standalone agent framework or SDK for external developers comparable to OpenAI Agents SDK or Google ADK. Agent capabilities are embedded in the Meta AI product, not exposed as a platform.

L7 Applications: Meta AI app and website, Meta AI in WhatsApp, Instagram, Facebook, Messenger. Ray-Ban Meta AI glasses (voice + vision). 3B+ users across Meta platforms. Targeting "personal superintelligence" as the product vision.

The picture: Meta has the most users (3B+), the most widely used open-weight models (Llama), and a clear vision ("personal superintelligence"). But the developer platform story is thin compared to Google, Microsoft, or Anthropic. Meta builds for its own surfaces first, not for external developers. The Limitless acquisition signals serious intent on the memory/wearable front, but nothing has shipped yet from the integration.

2.4 OpenAI

The company that started the current AI era. Strong at model and application layers, thinner in the middle.

L1 Infrastructure: Depends on Microsoft Azure for primary compute. Building the Stargate project (joint with SoftBank, targeting 500,000 GPUs). No own chip design.

L2 Models: Primarily closed, with a recent open-weight entry.

Closed: GPT-5 family (5, 5.1, 5.2, 5.4, 5.5). GPT-5.5 (released April 23, 2026) and GPT-5.5 Instant (default ChatGPT model since May 5, 2026) are the current frontier. The most widely recognized AI brand.
Open-weight: gpt-oss family (launched 2025). OpenAI's entry into open-weight, positioned as competitive but not the primary focus.
GPT-6 unannounced as of mid-May 2026. OpenAI is leaning into frequent point releases rather than another big-number leap.

L3 Inference: OpenAI API (the original LLM API that established per-token pricing as the standard). Also available via Azure OpenAI Service. No local/on-device inference story of their own.

L4 Connectivity: Has its own function calling / tool use protocol. Was slow to adopt MCP but is now a co-founder of the Agentic AI Foundation (AAIF). The Responses API supports tool calling. OpenAI's approach is tools-as-first-party-features (web search, code execution, file search built into the API) rather than a protocol for arbitrary third-party tools.

L5 Memory: ChatGPT memory: "Saved Memories" (explicit "remember this") plus "Chat History" (implicit references, Plus/Pro tiers). GPT-5.5 added enhanced personalization from past chats, files, and connected Gmail; rolling out to Plus and Pro on web, expanding to Free/Go/Business/Enterprise.

L6 Agents: OpenAI Agents SDK (Python, open-source). Codex (autonomous coding agent, 3M+ weekly developers; Background Computer Use on macOS; ChatGPT mobile preview May 2026). ChatGPT Agent Mode (merged Operator + Deep Research, browser and research). Multi-agent parallel execution.

L7 Applications: ChatGPT (the most recognized AI consumer product globally), ChatGPT Enterprise/Team, Codex, Sora (video generation). ChatGPT Go tier (lower cost) launched in 2026.

The picture: OpenAI's strength is brand and API distribution. ChatGPT is synonymous with "AI" for most consumers. The weakness: infrastructure dependence (Microsoft compute), no own connectivity protocol (adopted MCP late), and a thinner enterprise governance story than Microsoft, Google, or AWS.

2.5 Anthropic

Created the connectivity standard (MCP). Strong model, thin stack.

L1 Infrastructure: No own compute. Uses AWS (primary) and Google Cloud. No chip design.

L2 Models: Closed-weight only. Claude family: Claude Opus 4.7 (flagship, strongest on tool-augmented reasoning), Claude Sonnet 4.6, Claude Haiku 4.5. Claude Mythos Preview (limited release, 93.9% SWE-Bench Verified). Claude Sonnet 5 (92.4% SWE-Bench Verified). No open-weight models.

L3 Inference: Anthropic API (direct). Available via AWS Bedrock, Google Vertex AI, and Microsoft Foundry. No local/on-device inference.

L4 Connectivity: Created MCP (Model Context Protocol), the industry standard. Donated MCP to the Agentic AI Foundation (AAIF) under the Linux Foundation in December 2025, co-founded with Block and OpenAI. Supporting members include Google, Microsoft, AWS. MCP has ~97M monthly SDK downloads and 9,400+ servers in the official registry (11,840+ on PulseMCP, 21,000+ on Glama) as of May 2026. Anthropic's single most impactful contribution to the ecosystem beyond Claude itself.

L5 Memory: Claude's built-in memory in claude.ai (user memory edits, conversation-derived memory, project instructions). Claude Code's Auto Dream sleep-consolidation system (4 phases, 24-hour-and-5-session trigger, manual /dream command). The ZenBrain paper (April 2026) cites Auto Dream as independent validation of neuroscience-inspired memory consolidation in production.

L6 Agents: Claude Code (terminal-based coding agent, ~4% of public GitHub commits as of early 2026). Claude Agent SDK. Cowork (agentic harness; ships inside Microsoft 365 Copilot as Copilot Cowork, GA May 1, 2026). Claude Managed Agents (public beta April 8, 2026). Claude Code Agent View (multi-session CLI management, May 2026). Anthropic's agent strategy is "through partners' distribution": their agent technology runs inside Microsoft's products.

L7 Applications: claude.ai (web/mobile/desktop), Claude for Enterprise/Team, Claude Code, Claude for Small Business (May 13, 2026). Powers features in many third-party products (Cursor, Windsurf).

The picture: Anthropic punches above its weight through protocol leadership (MCP) and model quality (Claude's tool-augmented reasoning). The strategic weakness is the same as OpenAI's: no own infrastructure, no open-weight models, dependent on AWS/GCP. The strength: MCP gives Anthropic influence over the connectivity layer that no single model improvement could match.

2.6 Amazon / AWS

Infrastructure and hosting. Not a model company.

L1 Infrastructure: One of the three major cloud providers. Designs own AI chips: Trainium for training, Inferentia for inference, Graviton for general-purpose Arm CPUs. Trainium3 (first 3nm AI chip from AWS, GA early 2026) is nearly fully subscribed and targets token-heavy reasoning agents and video generation. Graviton5 (launched December 2025) has 192 cores and increased cache; Meta signed a multibillion-dollar deal for Graviton5 to back its $135B 2026 capex. Anthropic trains Claude on Trainium. Massive global data center footprint.

L2 Models: No frontier LLM of their own. Amazon Nova family exists but is not competitive at the frontier. AWS Bedrock hosts other companies' models (Claude, Llama, Mistral, Cohere) as a model marketplace. Granular cost attribution for Bedrock launched April 2026.

L3 Inference: AWS Bedrock (managed inference for hosted models), SageMaker (ML platform), Lambda (serverless), EC2 instances with GPU/Trainium. The infrastructure play: you bring any model, AWS runs it.

L4 Connectivity: Supporting member of the Agentic AI Foundation. Clare Liguori (AWS Senior Principal Engineer) is a Core Maintainer of MCP. AWS Kiro IDE supports MCP. AgentCore Gateway wraps Lambda functions and APIs as MCP tools, one of the cleanest "API to MCP" patterns in any cloud.

L5 Memory: AgentCore Memory provides managed memory primitives: semantic, user preference, summary strategies. Short-term (within session) and long-term (across sessions). Memory branching for multi-agent systems. Bedrock Knowledge Bases handles managed RAG. OpenSearch Serverless provides vector search for custom RAG. Neptune graph database supports graph-based memory (Mem0 compatible).

L6 Agents: AgentCore is the agent platform with nine modular services: Runtime (microVM isolation, MCP/A2A/AG-UI support, S3/EFS mounts), Gateway (API/Lambda to MCP), Identity (Okta/Entra/Cognito, OBO token exchange GA April 30 2026), Memory, Observability (OpenTelemetry to CloudWatch), Evaluations (13 evaluators, A/B testing preview April 30 2026), Policy, Browser (OS-level interaction added April 8 2026), Code Interpreter, and Payments (preview May 7 2026, x402 protocol, Coinbase + Stripe). Managed harness (preview April 22 2026). AWS GovCloud (US-West) availability May 2026. Strands Agents is the open-source framework (Python + TypeScript, used internally by Amazon Q, AWS Glue, Kiro). Kiro Powers bundle MCP servers with steering files for on-demand loading. AWS Agent Registry (preview April 9 2026).

L7 Applications: Amazon Q (enterprise assistant for AWS customers). Alexa+ ($19.99/month for AI-powered Alexa). Neither is a major consumer AI surface compared to ChatGPT or Gemini.

The picture: AWS is the landlord, not the tenant. They don't build the brains; they rent the space where brains run. Their agent infrastructure (AgentCore, Strands) is the most modular of any provider: ten independent services vs Google's more opinionated stack or Microsoft's deeply integrated M365 approach. The April-May 2026 release velocity has been remarkable.

2.7 xAI

Fast-moving infrastructure-first play built around the X/Twitter data advantage.

L1 Infrastructure: Colossus supercluster in Memphis, Tennessee, described as the world's largest single-site AI training installation. Expanding to 2GW (Colossus 2). Purchased 550,000 GB200/GB300 GPUs from Nvidia. SpaceX recently acquired xAI, tying the infrastructure story to a broader Musk ecosystem.

L2 Models: Closed-weight only.

Grok 4.3 (released May 6, 2026 via API): xAI's fastest and most intelligent model to date. 1M-token context, native video input, reasoning, function calling, structured outputs, prompt caching. Pricing $1.25/M input, $2.50/M output (a ~40% cut from prior frontier models). Tops Artificial Analysis leaderboards in agentic tool calling and instruction following. Eight legacy Grok models retire May 15, 2026 and redirect to grok-4.3.
Grok 4.20 introduced multi-agent architecture: 4-agent system (Grok, Harper, Benjamin, Lucas) and 16-agent Heavy variant.
grok-code-fast-1: purpose-built coding model.
Grok Voice: voice agent with tool calling, real-time data access, deployed in Tesla vehicles.
Aggressively priced: Grok 4.1 Fast at $0.20/M input tokens undercuts every frontier competitor.

L3 Inference: xAI API (OpenAI-compatible format, making it a drop-in replacement in many toolchains). grok.com web interface. Available in Cursor, Cline, and other IDE tools.

L4 Connectivity: Supports Remote MCP tools via API. Server-side tools include Web Search, X Search (unique: real-time access to X/Twitter data), Code Execution, File Search. The X Search capability is structural: no other model has native access to a live social media firehose.

L5 Memory: Grok on grok.com has conversational memory within sessions. Built-in RAG (Collections API) for uploaded document stores. No persistent cross-session memory as a shipped product. Grok 5 is expected to include persistent memory across sessions.

L6 Agents: Grok 4.20 is the first production multi-agent system from a frontier model provider. 4 or 16 specialized agents that coordinate on complex tasks. grok-code-fast-1 for agentic coding. Voice Agent API for voice-driven agentic workflows. No standalone agent framework/SDK for developers to build their own agents.

L7 Applications: Grok on X (embedded in the X/Twitter platform, ~600M monthly users), standalone Grok app (mobile), grok.com web, Grok in Tesla vehicles (voice). SuperGrok ($30/month), Grok Business ($30/seat/month), SuperGrok Heavy tier. Grok for Government announced.

The picture: xAI moves faster than anyone. Grok went from nonexistent to frontier-competitive in under two years. The X data firehose is a structural moat no competitor can replicate. Multi-agent architecture shipped in production before anyone else. The weakness: developer platform depth. No agent SDK, no memory framework, thin tooling ecosystem.

2.8 Alibaba / Qwen

The most complete open-weight stack in the world.

L1 Infrastructure: Alibaba Cloud, the largest cloud provider in Asia. Own data centers. Uses Nvidia GPUs.

L2 Models: Open-weight only (Apache 2.0), with hosted closed variants.

Qwen 3.6-27B (released April 22, 2026 on HuggingFace and ModelScope under Apache 2.0): 27B dense model with native multimodal support (text, image, video) and a 262,144-token context window. Scores 77.2 on SWE-Bench Verified, outperforming the 397B Qwen 3.5 model on every major coding benchmark despite being 1/15 the parameter count. Hybrid thinking/non-thinking modes in a single unified checkpoint. Novel "Thinking Preservation" mechanism retains reasoning traces across turns.
Qwen 3.5 (February 2026): 397B-A17B MoE flagship. Native multimodal (text, image, video, audio). 1M token context on hosted Plus version. Supports 201 languages.
Qwen 3.6-35B-A3B: MoE model with only 3B active parameters, runs efficiently on consumer hardware.
Full size range from 0.6B to 397B, covering edge devices to data center.
Qwen-VL (vision-language), Qwen-Audio, Qwen3-Omni (all modalities).

L3 Inference: Alibaba Cloud Model Studio (managed hosting), DashScope API, together.ai. Full support for Ollama, vLLM, LM Studio, llama.cpp, SGLang, MLX. Qwen models are first-class citizens on HuggingFace.

L4 Connectivity: Native MCP support in Qwen 3 and later. Structured function calling optimized for multi-tool orchestration. Qwen-Agent framework handles MCP integration.

L5 Memory: Qwen-Agent framework includes memory capabilities. The "Thinking Preservation" feature in Qwen 3.6 retains reasoning context across conversation history, a form of working memory within sessions.

L6 Agents: Qwen-Agent (open-source agent framework with tool use, planning, and memory). Qwen Code (open-source terminal coding agent, like Claude Code but for Qwen models). Agentic coding is the primary focus of Qwen 3.6. 59.3 on Terminal-Bench 2.0, matching Claude 4.5 Opus.

L7 Applications: Qwen Chat (qwen.ai), Alibaba Cloud Model Studio. QwenPaw (open-source personal assistant with multi-channel support: DingTalk, WeChat, Discord, Telegram). BMW China integration. Primary market is Asia and global developer community.

The picture: Alibaba is building the open-weight version of what Google does across the full stack. The Qwen model family is arguably the most complete open-weight ecosystem: covering edge to data center, text to omni-modal, chat to agentic coding. For anyone building locally on their own hardware, Qwen is the primary alternative to Google's Gemma. The weakness: limited brand recognition outside Asia and the developer community.

2.9 Apple

The on-device play. Licensing the brain, owning the body.

Apple's three-tier inference architecture. The Gemini license is a bridge; the data path stays inside Apple.

L1 Infrastructure: Designs its own silicon (A-series, M-series). Apple Silicon (M5 Pro, M5 Max, M5 Ultra) is the most capable consumer hardware for local AI inference. Private Cloud Compute (PCC): dedicated Apple data centers running custom M5-series silicon in a stateless, encrypted environment. PCC is unique: even third-party model weights (Google's Gemini) run on Apple's own hardware, so user data never reaches Google.

L2 Models: Own small models (Apple Foundation Models, ~3B parameters, on-device). Licensing Google's Gemini for heavy reasoning via a multi-year deal worth ~$1B/year. Also developing Ferret-3 (own larger models targeting 2027 deployment).

L3 Inference: Apple CoreML and Apple Neural Engine (ANE) for on-device inference. Private Cloud Compute for cloud inference. No developer-facing inference API comparable to OpenAI API or Anthropic API.

L4 Connectivity: Siri integration with apps via App Intents framework. No MCP adoption announced. Apple's connectivity model is proprietary and tightly controlled within the iOS/macOS ecosystem.

L5 Memory: The upgraded Siri (2026) includes conversational memory: remembering past interactions and connecting information over time. "Personal Context" index that understands preferences, relationships, and history. On-device processing for privacy. This creates significant vendor lock-in: your Personal Context index becomes the reason you can't switch to Android.

L6 Agents: The 2026 Siri overhaul moves from voice assistant to "system orchestrator": on-screen awareness, multi-step task execution across apps, agentic capabilities within the OS. Ferret-UI Lite is Apple's on-device GUI agent model, designed to let Siri see and control iPhone apps directly (works across mobile, web, and desktop). As of mid-May 2026, the upgraded Siri with personal context and on-screen understanding has not shipped yet; the most likely launch venue is WWDC 2026 (June), with iOS 27 as the carrier. Not a developer-facing agent framework. Apple's agents are consumer product features, not developer tools.

L7 Applications: Siri (2B+ devices), Apple Intelligence features across iOS/iPadOS/macOS, Writing Tools, Visual Intelligence. The largest device install base of any AI company. "Apple Intelligence Pro" subscription ($15/month) for advanced agentic Siri capabilities.

The picture: Apple's play is the opposite of everyone else's. They don't compete on model quality or developer tools. They compete on device distribution and privacy. 2B+ devices, on-device processing, data never leaves the user's hardware (or Apple's hardware). The Gemini licensing deal is a "bridge strategy": buying time while Apple develops its own larger models.

2.10 NVIDIA

Every other provider on this page runs on NVIDIA hardware. For most of this document's history, that made NVIDIA a component supplier, not a platform company. That changed at GTC Taipei on June 1, 2026, when Jensen Huang revealed a full-stack play spanning silicon to agents to consumer PCs. NVIDIA now competes at almost every layer.

L1 Infrastructure: The dominant position. NVIDIA designs the GPUs that train and run nearly every frontier model in the world. Current generation: Blackwell (B200, GB200, GB300). Next generation: Vera Rubin NVL72, a rack-scale AI supercomputer connecting 36 Vera CPUs and 72 Rubin GPUs, delivering 10x inference performance per watt over Blackwell. For personal computing: RTX Spark, NVIDIA's first laptop/desktop chip (ARM-based, co-designed with MediaTek, up to 1 petaFLOP AI performance, 128GB unified memory). DGX Spark and DGX Station bring data-center-class AI to the desk. Three-generation roadmap committed: Blackwell, then Rubin, then Rosa Feynman. TSMC manufactures the silicon; 150 supply chain partners in Taiwan build the systems.

L2 Models: All open-weight. Nemotron 3 Ultra (550B MoE, 55B active parameters) for long-running autonomous agents. Cosmos 3, an "omnimodel" for Physical AI with native vision reasoning, world generation, and action generation, in Super (32B) and Nano (8B) sizes. Canary Qwen 2.5B (top of HuggingFace ASR leaderboard). Parakeet (streaming speech-to-text). These are not frontier chat models competing with GPT or Claude; they are specialized models for agents, robotics, and perception.

L3 Inference: The dominant optimization layer. CUDA is the programming model everything runs on. TensorRT and TensorRT-LLM optimize model serving. NIM (NVIDIA Inference Microservices) packages models as deployable containers. NeMo provides the training and fine-tuning framework. Most third-party inference solutions (vLLM, llama.cpp, Ollama) ultimately depend on CUDA.

L4 Connectivity: Agent Toolkit provides the orchestration framework. OpenShell is a secure runtime for autonomous agents with governance built in. Supports MCP. Not a protocol creator (unlike Anthropic with MCP or Google with A2A), but provides the infrastructure that protocols run on.

L5 Memory: No memory product. The one layer where NVIDIA has no offering.

L6 Agents: NemoClaw is the agent governance framework, providing guardrails for autonomous agent behavior. OpenClaw is the open-source agent runtime. The Agent Toolkit bundles models, tools, and governance into a deployable agent stack. All announced or expanded at GTC Taipei 2026. NVIDIA positions these as the safety layer that makes autonomous agents enterprise-ready.

L7 Applications: Not consumer chat apps. Instead: RTX Spark laptops and desktops (fall 2026, from ASUS, Dell, HP, Lenovo, Microsoft Surface, MSI), DGX Spark/Station (developer AI workstations), Jetson Thor (edge AI and robotics), Isaac (robotics simulation and training platform), Alpamayo 2 (Level 4 robotaxi platform). The RTX Spark announcement represents NVIDIA entering the consumer PC chip market for the first time, directly competing with Intel, AMD, Qualcomm, and Apple.

The unique position: NVIDIA is the only company that designs the compute hardware everyone else depends on AND is moving up the stack into models, agents, and applications. The risk for every other provider: NVIDIA could vertically integrate at any point. The opportunity: NVIDIA's open-weight approach to models and agent frameworks means the ecosystem builds on top rather than competing with a closed platform.

2.11 Other notable players

Mistral (Paris, France): Open-weight models (Mistral Large, Mistral Small, Codestral). Strong in Europe. Available via API (La Plateforme) and locally. Supports MCP. No agent framework, no memory product. The European alternative for organizations that need EU-headquartered AI.

DeepSeek (Hangzhou, China): Open-weight models (DeepSeek V3, V3.2, R1 for reasoning, V4 released April 24 2026 with 80.6 SWE-Bench Verified and 1M context). Known for extremely efficient training, achieved frontier-competitive performance at a fraction of the cost of Western competitors. Pure model play.

Moonshot AI (China): Kimi K2.6 (April 2026), 42B active / 1T total MoE, MIT license. First non-Western model to reach Tier A in coding benchmarks.

Samsung: On-device AI on Galaxy devices. Gauss models (own). Not a developer platform player.

3. The connective tissue: infrastructure-layer players

These companies don't make models. They make the middle layers work. Without them, the gap between "a model exists" and "I can use it" would be enormous.

The shared layer that lets open-weight models from any provider run anywhere.

3.1 HuggingFace

Spans Layers 2, 3, and parts of 6.

L2 (Models): The GitHub of AI models. HuggingFace Hub hosts 1M+ models. Every open-weight model from every provider (Llama, Gemma, Qwen, Mistral, DeepSeek, Phi) is distributed through HuggingFace. The Transformers library is the standard for loading and using models in Python.

L3 (Inference): HuggingFace Inference Endpoints (managed hosting for any model on the Hub). Inference API for quick testing. Spaces (hosted demos/apps). Text Generation Inference (TGI) for production serving.

Broader role: Community hub, paper discussion, dataset hosting, model evaluation leaderboards. HuggingFace is not a competitor to any model provider; it's the distribution infrastructure that makes the open-weight ecosystem function.

3.2 Ollama

Layer 3 only, but indispensable.

The simplest way to run open-weight models locally. ollama pull gemma4:27b and you have a model running on your machine. Handles model downloads, quantization, and serving via a local API (OpenAI-compatible format). Runs on Mac, Linux, Windows. Library is now at 4,500+ models as of May 2026.

3.3 LM Studio

Layer 3 + Layer 7. Desktop GUI for discovering, downloading, and running local models. More visual than Ollama. Includes a chat interface. Cross-platform. Good for people who want a ChatGPT-like experience over local models without command-line work.

3.4 Msty

Layer 3 + Layer 5 + Layer 7. Desktop app with chat UI, multi-model comparison, Knowledge Stacks (built-in RAG, a basic form of Layer 5 memory), MCP Toolboxes (Layer 4 connectivity), and Agent Mode. The most feature-complete local AI desktop app. Free desktop tier with Pro features.

3.5 vLLM

Layer 3 only. High-performance inference engine for production deployments. Significantly faster than Ollama for high-throughput serving. Used by companies running local models at scale. Not end-user-facing.

4. Summary: who has what

Provider coverage across the seven layers. Google and Microsoft are full across all seven. NVIDIA covers six of seven, missing only memory, but dominates the compute and inference layers everyone else depends on.

Provider	L1 Compute	L2 Models	L3 Inference	L4 Connectivity	L5 Memory	L6 Agents	L7 Applications
Google	Own (TPU)	Closed + Open	Cloud + Edge	MCP + A2A	Gemini Memory + Memory Bank	ADK + Vertex Agent Engine	Gemini app, Workspace, Siri
Microsoft	Azure (Nvidia)	Open (Phi) + Partners	Foundry, Foundry Local	MCP adopted	M365 Memory + GitHub Memory	Copilot Studio, Agent 365, Cowork	M365 Copilot, GitHub Copilot
Meta	Own DC (Nvidia)	Closed (Muse) + Open (Llama)	Own infra	MCP (AAIF member)	Acquired Limitless	Multi-agent in Muse Spark	Meta AI, WhatsApp, IG, glasses
OpenAI	Microsoft Azure	Closed (GPT) + Open (gpt-oss)	OpenAI API	Function calling + AAIF	ChatGPT Memory (+GPT-5.5 personalization)	Agents SDK, Codex, ChatGPT Agent	ChatGPT
Anthropic	AWS + GCP	Closed only (Claude)	Anthropic API	Created MCP	Claude memory + Auto Dream	Claude Code, Agent SDK, Cowork	claude.ai
AWS	Own (Trainium)	No frontier model	Bedrock, SageMaker	MCP (core maintainer)	AgentCore Memory + Bedrock KB	AgentCore, Strands, Kiro	Amazon Q, Alexa+
xAI	Own (Colossus)	Closed (Grok)	xAI API	MCP + X Search	Basic / in development	Multi-agent (4/16 agents)	Grok on X, app, Tesla
Alibaba	Alibaba Cloud	Open only (Qwen)	Ali Cloud + everywhere	Native MCP	Qwen-Agent memory	Qwen-Agent, Qwen Code	Qwen Chat, QwenPaw
Apple	Own (Apple Silicon)	Small own + licensed Gemini	CoreML, PCC	Proprietary (App Intents)	Siri Personal Context	Siri system orchestrator	Siri, Apple Intelligence
NVIDIA	Own (GPUs, RTX Spark)	Open (Nemotron, Cosmos)	TensorRT, NIM, CUDA	Agent Toolkit, OpenShell	None	NemoClaw, OpenClaw	RTX Spark PCs, DGX, Jetson

5. What this map reveals

Nobody owns the full stack alone. Google comes closest. Everyone else has gaps. Microsoft has no frontier model. OpenAI has no infrastructure. Anthropic has no open-weight model. Meta has no developer platform. Apple has no developer-facing AI tools. NVIDIA covers six of seven layers but has no memory product. Every company depends on at least one other company for a critical layer.

NVIDIA is the foundation everyone else builds on. This was always true for compute (L1) and inference optimization (L3). What changed at GTC Taipei 2026 is that NVIDIA moved aggressively into models (L2), agents (L6), and consumer hardware (L7). RTX Spark puts NVIDIA inside laptops for the first time, competing with Intel, AMD, Qualcomm, and Apple. The open-weight approach to Nemotron and Cosmos means the ecosystem builds on top rather than being locked out, but the vertical integration potential is clear. When your GPU supplier also ships models, agent frameworks, and complete PCs, the competitive dynamics of the entire stack change.

The open-weight ecosystem is the real connective tissue. Llama (Meta), Gemma (Google), Qwen (Alibaba), Phi (Microsoft), and Mistral exist in a shared infrastructure layer (HuggingFace + Ollama + vLLM) that belongs to no single company. This ecosystem is model-provider-independent by design. A memory framework built on Mem0 + Ollama + Qwen runs entirely on your hardware, depends on no cloud provider, and costs nothing per query. That's new. Two years ago, using AI meant paying a cloud provider per token.

Memory is the least mature layer across all providers. Every major provider has something (ChatGPT memory, Copilot Memory, Claude memory, Gemini memory) but none are architecturally sophisticated. GitHub Copilot Memory (agentic, cross-agent, validated against codebase) is the most interesting, but it's scoped to repositories, not people. The standalone memory frameworks (Mem0, Zep, Letta) are more advanced than what any major provider ships natively. (For depth on this, see the AI Memory dossier.)

MCP is winning the connectivity layer. Co-founded by Anthropic, OpenAI, and Block. Supported by Google, Microsoft, AWS, NVIDIA (via Agent Toolkit and OpenShell). Natively supported by Qwen. ~97M monthly SDK downloads, 9,400+ servers in the official registry as of May 2026. The only holdout of note is Apple (proprietary App Intents framework). MCP becoming the standard means that tools, memory systems, and agent frameworks can be built once and work across any model from any provider. (For depth, see the MCP dossier.)

The agent layer is the current battlefield. Every provider shipped agent products or frameworks in the last six months. Google's ADK, OpenAI's Agents SDK, Anthropic's Agent SDK, AWS's Strands, xAI's multi-agent Grok, Alibaba's Qwen-Agent, Microsoft's Copilot Studio + Agent 365 + the new Microsoft Agent Framework 1.0 (GA April 3, 2026), NVIDIA's NemoClaw + Agent Toolkit + OpenShell (GTC Taipei, June 1, 2026). The frameworks are converging on similar patterns (tool calling, planning loops, human-in-the-loop). NVIDIA's angle is distinctive: they are not building the agents themselves but the governance and runtime layer that makes autonomous agents safe for enterprise use. Differentiation is shifting from "can your agent use tools?" (everyone can) to "can your agent remember, learn, and improve?" which brings it back to the memory layer. (See the Agents dossier.)

6. What's next

This document maps where things stand today. Possible next steps:

Deep-dive on any single provider's stack end-to-end
Compare Google vs Microsoft vs Anthropic for a specific use case (e.g., building a local coding agent)
Explore the open-weight ecosystem path: what can you build on a Mac with Qwen + Ollama + Mem0 + MCP without any cloud dependency?
Track how the memory layer evolves as GPT-6, Grok 5, and next-gen Gemini ship
Map the embodied AI players (Tesla Optimus, Figure, 1X) against this same stack