Physical AI · Layer 3

Everything about Physical AI Software

The software stack of a Physical AI system has three sub-layers: the runtime, the AI models, and the tools that connect them. The runtime layer is JetPack 6.2, with CUDA 12.6 and TensorRT 10.3. Perception models include YOLO26 for object detection, Whisper Turbo for speech, DeepStream for multi-camera, and SAM3 for segmentation. Decision and action models include GR00T N1.7, imitation versus reinforcement learning, Nav2 for navigation, and MoveIt 2 for arms. The world-model debate is NVIDIA's Cosmos 3.0 versus Yann LeCun's V-JEPA 2.1. The connecting tools are ROS2, the Isaac platform, Newton physics, Hugging Face LeRobot, AgenticROS, and NemoClaw.

Reference   Last updated May 17, 2026

1. The software stack

The software layer has three sub-layers stacked on top of each other, each depending on the one below: runtime and OS (foundation), frameworks and platforms (tools), AI models (intelligence).

The Physical AI software stack Three sub-layers of Physical AI software. Bottom: runtime and OS (JetPack, CUDA, Ubuntu). Middle: frameworks and platforms (ROS2, Isaac, LeRobot, AgenticROS). Top: AI models (YOLO, Whisper, GR00T, Cosmos, JEPA). The software stack inside a Physical AI robot AI models (the intelligence) YOLO26, Whisper, GR00T N1.7, Cosmos 3.0, V-JEPA 2.1, local LLMs trained and deployed using Frameworks and platforms (the tools) ROS2, Isaac Sim / Lab / ROS, Newton, LeRobot, AgenticROS, NemoClaw runs on top of Runtime and OS (the foundation) JetPack 6.2 (Ubuntu 22.04 + CUDA 12.6 + cuDNN 9.3 + TensorRT 10.3)
Three sub-layers of robot software. Each depends on the one below.

The dataflow: sensor data arrives, the runtime processes it (CUDA), perception models interpret it (YOLO, Whisper), decision models choose actions (GR00T), commands go to actuators. The frameworks (ROS2, Isaac, LeRobot) handle communication, training, and simulation between everything.

2. Runtime and OS

The runtime layer sits directly on the hardware. You install it once when you set up your Jetson, and everything else runs on top of it.

JetPack SDK (the bundle)

Not a single piece of software but a bundle that installs everything a Jetson needs to run AI. JetPack 6.2 is the current production release (verified May 2026).

ComponentVersionWhat it does
Jetson Linux (BSP)36.4.3Ubuntu 22.04-based OS with NVIDIA drivers, bootloader, Linux Kernel 5.15
CUDA Toolkit12.6Parallel computing platform. Lets AI code run on the GPU instead of the CPU (100-1000x faster)
cuDNN9.3CUDA Deep Neural Network library. Optimized building blocks for neural networks
TensorRT10.3Inference optimizer. Makes trained models run as fast as possible on NVIDIA hardware
VPI3.2Vision Programming Interface. CV algorithms accelerated on GPU and dedicated vision hardware
OpenCV4.8+Open Computer Vision library. Image processing, camera capture, basic CV

JetPack 7 (for Jetson Thor): based on SBSA (Server Base System Architecture). CUDA 13.0. Aligns Jetson Thor with industry-standard ARM server design.

CUDA: why NVIDIA wins

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform. A GPU has thousands of small cores that process data in parallel; a CPU has 4-16 large cores. For AI workloads (matrix multiplication, image processing), GPUs are 100-1000x faster.

Nearly every AI framework (PyTorch, TensorFlow, YOLO, Whisper, Stable Diffusion) uses CUDA. Any model that runs on an NVIDIA data center GPU also runs on a Jetson because both use CUDA. Skills, code, and models transfer between any NVIDIA device. No other chip maker has this ecosystem.

TensorRT: making models fast

You train a model in PyTorch on a powerful cloud GPU. It runs at 5 fps on a Jetson. TensorRT analyzes it, fuses layers, reduces precision (from 32-bit to 16-bit or 8-bit), optimizes memory layout. The same model now runs at 30+ fps. TensorRT.

Install resources: JetPack overview | JetPack 6.2 details | Orin Nano getting-started guide

3. Perception models

Answer: "What is the robot seeing and hearing right now?" Take raw sensor data and extract meaningful information (objects, speech, depth).

YOLO26 (object detection)

Object detection takes a camera image and answers: "What objects are in this image, where are they, and what are they?" Output is bounding boxes with labels and confidence scores.

YOLO (You Only Look Once) is the dominant model family. It processes the entire image in a single pass, fast enough for real-time use on edge hardware. YOLO26 (January 2026) is the latest: end-to-end NMS-free inference (removes the post-processing step previous versions needed), 43% faster on CPU, edge-optimized. MuSGD optimizer brings LLM training techniques into CV. Supports detection, segmentation, classification, pose estimation, and oriented bounding boxes in one model.

On a Jetson Orin Nano: 30-60+ fps depending on model size and resolution. Download a YOLO26 model (e.g., yolo26n.pt for the smallest "nano" size), point at a camera feed, every object labeled in real time. YOLO26 docs | Ultralytics GitHub.

Whisper (speech-to-text)

OpenAI's open-source speech-to-text model. Trained on 680,000 hours of multilingual audio. Handles accents, background noise, 99 languages. Open-source, runs locally on a Jetson with no cloud needed.

SizeParametersBest for
Tiny39MQuick demos, low-power devices (32x real-time on GPU)
Base / Small74M / 244MReal-time on Jetson Orin Nano. Good accuracy/speed balance.
Medium / Large-v3769M / 1.5BHighest accuracy. Large-v3 has 128 Mel bins.
Large-v3 Turbo809MRecommended. 5.4x faster than Large with nearly identical accuracy.

Word Error Rate: 2.7% on clean English audio (benchmark), 5-6% in real-world conditions. Matches professional human transcriptionists on clean audio. Whisper GitHub.

DeepStream (multi-camera pipelines)

NVIDIA's SDK for building video analytics pipelines. Handles multiple camera streams: decoding, AI inference, tracking, alerts. Hardware-accelerated on Jetson. Used for smart security (4 cameras detecting people / packages / animals), retail analytics, industrial inspection, traffic monitoring. YOLO detects objects in a single image; DeepStream manages the whole pipeline. DeepStream.

Other perception models

TaskWhat it doesCommon models
Depth estimationEstimate distance from a single 2D image (no depth sensor needed)MiDaS, ZoeDepth, Depth Anything v2
Semantic segmentationLabel every pixel ("road," "tree")SAM3 (Meta), DeepLab
Pose estimationDetect human body keypoints from images / videoYOLO26 Pose, MediaPipe, OpenPose
Face detection / recognitionFind faces, optionally identify whoRetinaFace, ArcFace, InsightFace
OCRRead text from images. License plates, signs, documents.PaddleOCR, EasyOCR, Tesseract

4. Decision and action models

Answer: "Given what I'm seeing, what should the robot do?" Take perception output and produce physical actions.

GR00T (NVIDIA's robot brain)

GR00T (Generalist Robot 00 Technology) is NVIDIA's family of foundation models for humanoid robots. Like GPT for text, but for robot control. Takes three inputs: camera images, language instructions, body state (joint positions). Outputs motor commands.

VersionDateKey capability
GR00T N1GTC 2025First open foundation model for humanoid reasoning. Cross-embodiment (works across different robot bodies).
GR00T N1.52025Generated using GR00T-Dreams synthetic data. 40% improvement combining synthetic + real data.
GR00T N1.6CES 2026Open reasoning VLA model. Full body control. Uses Cosmos Reason for context.
GR00T N1.7May 2026Whole-body humanoid control via Unitree G1. Predicts latent action tokens for coordinated locomotion + manipulation.

Cross-embodiment: a GR00T model trained on one robot body can be adapted to another with fine-tuning. Why it is a "foundation" model.

GR00T-Dreams: NVIDIA's synthetic data generation blueprint. Generated 780,000 synthetic trajectories (equivalent to 9 months of human demonstrations) in 11 hours using Isaac Sim. GR00T GitHub | Developer page.

Imitation learning vs reinforcement learning

Imitation learningReinforcement learning (RL)
How it worksYou demonstrate the task; robot reproduces it.Robot tries things; gets rewards / penalties; improves over thousands of attempts.
Speed to set upFast.Slow. Needs simulation (Isaac Lab) for safety and scale.
GeneralizationLimited to demonstrated tasks.More robust. Can discover novel strategies.
ComputeLess for training. Runs on a laptop.Massive (GPU clusters in simulation).
Used byLeRobot, SO-ARM101.Robot dogs learning to walk. Isaac Lab.

Modern approaches combine both: start with imitation learning (fast, gets close to correct behavior), refine with RL (makes it robust). GR00T N1.7 uses this combined approach.

Navigation and path planning

TaskWhat it doesKey tool
SLAMBuild a map while tracking position within it. Foundational for any mobile robot.Isaac ROS Visual SLAM, ORB-SLAM, rtabmap
Path planningCalculate best route. Update in real time as obstacles move.Nav2 (ROS2 navigation stack)
Obstacle avoidanceReact to unexpected obstacles in real time.Costmap layers in Nav2
Arm motion planningPlan collision-free arm movements. Grasp planning.MoveIt 2

5. World models

Answer: "If I do this action, what will happen next?" Let a robot "imagine" consequences before executing. The most cutting-edge area, with two fundamentally different approaches.

NVIDIA Cosmos (simulation-heavy)

Family of "world foundation models" that generate synthetic training data and predict physical outcomes.

ModelWhat it does
Cosmos Transfer 2.5Real video to synthetic training data. Augments small real-world datasets.
Cosmos Predict 2.5Predicts future states of physical environments.
Cosmos Reason 2Open reasoning VLM. Perception layer for GR00T N1.6+.
Cosmos 3.0 (GTC 2026)First world model unifying synthetic world generation, vision reasoning, and action simulation.

All Cosmos models open on Hugging Face. The NVIDIA approach: build detailed simulations, generate vast synthetic data, train on it, deploy. Requires massive compute (DGX clusters, Isaac Sim). Cosmos overview.

LeCun's JEPA (observation-heavy)

Yann LeCun (Turing Award) left Meta late 2025, founded AMI Labs in Paris. AMI raised $1.03B at $3.5B valuation March 2026. LeCun's argument: LLMs are a "dead end" for genuine intelligence because they predict text tokens without understanding the physical world.

JEPA (Joint Embedding Predictive Architecture): instead of predicting every pixel of what happens next, predicts in "latent space" (abstract representations). "The cup will fall" rather than generating a video of the cup falling. More efficient because it ignores unpredictable details and focuses on physics.

ModelKey result
V-JEPA 2 (June 2025)1.2B parameters. Trained on 1M+ hours of video + 62 hours of robot data. 65-80% zero-shot robot control. 30x faster than Cosmos on video benchmarks.
V-JEPA 2.1 (March 16, 2026)SOTA robot navigation. 10x faster planning. Improved training: all tokens (visible and masked) contribute to the self-supervised loss. Closes gap with DINOv2 on dense features.

The 62-hour number: V-JEPA 2 achieves competitive robot control from 62 hours of real robot data (on top of video pre-training). GR00T needed 780,000 synthetic trajectories. This efficiency difference is LeCun's core argument: learn from observation like a baby, do not simulate everything.

Honest assessment: V-JEPA results are real and published. $1B gives AMI Labs 4-5 years of runway. But AMI is months old and has shipped nothing commercially; NVIDIA has 110+ deployed partners. Either approach winning leaves perception, deployment, and governance skills relevant. V-JEPA 2 blog | paper | AMI Labs.

The governance gap
Neither Cosmos nor JEPA addresses governance. Who decides what the robot learns? What boundaries constrain its autonomous behavior? How do you audit a world model's internal representations? NVIDIA's NemoClaw is a first step (monitoring AI agent intent), but it is a security tool, not a governance framework. The governance layer guide covers this in depth.

6. Frameworks and platforms

The tools that connect runtime, models, and hardware into working robot applications. They handle camera input, model execution, motor output, communication between processes, simulation, training, deployment.

ROS2 (Robot Operating System version 2)

Not an actual OS. A set of libraries and tools for robot software: sensor processing, motor control, navigation, perception. The industry standard. Every serious robotics company uses it. Built around nodes (independent processes that do one thing), topics (named channels nodes publish / subscribe to), services (request / response between nodes), and actions (long-running goals like "navigate to point B").

Current release: ROS2 Jazzy Jalisco (LTS, 2024-2029). ROS2 docs.

The Isaac platform

NVIDIA's robotics platform family. Named after Isaac Asimov.

ComponentWhat it does
Isaac SimPhotorealistic 3D robotics simulator. Digital twins. Synthetic data generation. Built on NVIDIA Omniverse.
Isaac LabGPU-accelerated reinforcement learning inside Isaac Sim. Train thousands of robots simultaneously. Lab 3.0 early access at GTC 2026.
Isaac ROSCUDA-accelerated ROS2 stack. GPU-powered perception, navigation, motor control on the robot. 10-100x faster than CPU-only ROS2.
Isaac GR00TFoundation models for humanoid robots. The robot brain.

Newton (physics engine)

Open-source GPU physics engine built in collaboration with Google DeepMind and Disney Research. Newton 1.0 reached GA in April 2026. Makes simulation physics match real-world physics closely enough that sim-trained skills transfer to real robots. Disney Research uses Newton for its robotic character platform (Olaf BDX droids). Newton.

Hugging Face LeRobot

The accessible open-source robotics framework. PyTorch-based. Covers data collection, training, simulation, and deployment. Used with SO-ARM100 / 101 and Reachy Mini. Imitation learning is the primary approach.

NVIDIA Isaac models integrate into LeRobot as of CES 2026. The workflow: collect data with LeRobot, train with Isaac Lab, simulate in Isaac Sim, deploy on Jetson.

Reachy Mini App Store (May 6, 2026): Hugging Face launched an open-source app store for Reachy Mini with 200+ apps from 150 creators. ~10,000 Reachy Mini devices already shipped or in transit. Apps require no coding knowledge to build or use. Hugging Face's "ML Intern" agent generates and refines apps. The iOS App Store model, but for robot intelligence. Launch post | LeRobot.

AgenticROS (Claude-to-robot)

Open-source bridge connecting AI agents (Claude Code, Claude Desktop, Gemini, MCP) to ROS2 robots via natural language. Type commands in Claude, the robot executes. Created by Chris Matthieu. The lowest-friction path from software-AI skills (Claude, MCP) to physical robot control. agenticros.com.

NemoClaw (governance)

NVIDIA's security and governance layer for AI agents (announced GTC 2026). Monitors AI reasoning processes in real time and enforces safety guardrails. "Inspects the intent of the AI's logic." Works alongside AgenticROS. Covered in depth in the Governance guide.

Other tools

ToolWhat it does
RoboflowComputer vision model training and deployment. Annotate images, train custom YOLO models, deploy to edge.
Weights & BiasesML experiment tracking, model versioning, collaboration.
UltralyticsMakers of YOLO. YOLO26 is the latest.
MoveIt 2Motion planning framework for robotic arms. Built on ROS2. Standard for manipulation.
Open RoboticsMaintainers of ROS2 and Gazebo.

7. How a typical Physical AI app composes

A simple example: a robotic arm picking up a cup.

  1. Hardware (Layer 2): SO-ARM101 arm on a Jetson Orin Nano with a USB camera and a RealSense D435 depth sensor.
  2. Runtime: JetPack 6.2 boots, CUDA 12.6 is ready, TensorRT 10.3 has optimized your perception model.
  3. Perception: the USB camera streams images. YOLO26 detects "cup" with bounding box. The D435 reports depth: cup is 32 cm away.
  4. Decision: LeRobot policy (trained earlier via imitation learning) takes the cup's position and outputs joint angles for each of the 6 servos.
  5. Actuators: joint commands flow through ROS2 to the Feetech STS3215 servos. The arm moves, gripper closes.
  6. Optional natural-language layer: AgenticROS lets you type "pick up the red cup" in Claude. NemoClaw watches the reasoning and stops execution if a guardrail trips.

8. Resource directory