Everything about Physical AI Software

1. The software stack

The software layer has three sub-layers stacked on top of each other, each depending on the one below: runtime and OS (foundation), frameworks and platforms (tools), AI models (intelligence).

Three sub-layers of robot software. Each depends on the one below.

The dataflow: sensor data arrives, the runtime processes it (CUDA), perception models interpret it (YOLO, Whisper), decision models choose actions (GR00T), commands go to actuators. The frameworks (ROS2, Isaac, LeRobot) handle communication, training, and simulation between everything.

2. Runtime and OS

The runtime layer sits directly on the hardware. You install it once when you set up your Jetson, and everything else runs on top of it.

JetPack SDK (the bundle)

Not a single piece of software but a bundle that installs everything a Jetson needs to run AI. JetPack 6.2 is the current production release (verified May 2026).

Component	Version	What it does
Jetson Linux (BSP)	36.4.3	Ubuntu 22.04-based OS with NVIDIA drivers, bootloader, Linux Kernel 5.15
CUDA Toolkit	12.6	Parallel computing platform. Lets AI code run on the GPU instead of the CPU (100-1000x faster)
cuDNN	9.3	CUDA Deep Neural Network library. Optimized building blocks for neural networks
TensorRT	10.3	Inference optimizer. Makes trained models run as fast as possible on NVIDIA hardware
VPI	3.2	Vision Programming Interface. CV algorithms accelerated on GPU and dedicated vision hardware
OpenCV	4.8+	Open Computer Vision library. Image processing, camera capture, basic CV

JetPack 7 (for Jetson Thor): based on SBSA (Server Base System Architecture). CUDA 13.0. Aligns Jetson Thor with industry-standard ARM server design.

CUDA: why NVIDIA wins

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform. A GPU has thousands of small cores that process data in parallel; a CPU has 4-16 large cores. For AI workloads (matrix multiplication, image processing), GPUs are 100-1000x faster.

Nearly every AI framework (PyTorch, TensorFlow, YOLO, Whisper, Stable Diffusion) uses CUDA. Any model that runs on an NVIDIA data center GPU also runs on a Jetson because both use CUDA. Skills, code, and models transfer between any NVIDIA device. No other chip maker has this ecosystem.

TensorRT: making models fast

You train a model in PyTorch on a powerful cloud GPU. It runs at 5 fps on a Jetson. TensorRT analyzes it, fuses layers, reduces precision (from 32-bit to 16-bit or 8-bit), optimizes memory layout. The same model now runs at 30+ fps. TensorRT.

Install resources: JetPack overview | JetPack 6.2 details | Orin Nano getting-started guide

3. Perception models

Answer: "What is the robot seeing and hearing right now?" Take raw sensor data and extract meaningful information (objects, speech, depth).

YOLO26 (object detection)

Object detection takes a camera image and answers: "What objects are in this image, where are they, and what are they?" Output is bounding boxes with labels and confidence scores.

YOLO (You Only Look Once) is the dominant model family. It processes the entire image in a single pass, fast enough for real-time use on edge hardware. YOLO26 (January 2026) is the latest: end-to-end NMS-free inference (removes the post-processing step previous versions needed), 43% faster on CPU, edge-optimized. MuSGD optimizer brings LLM training techniques into CV. Supports detection, segmentation, classification, pose estimation, and oriented bounding boxes in one model.

On a Jetson Orin Nano: 30-60+ fps depending on model size and resolution. Download a YOLO26 model (e.g., yolo26n.pt for the smallest "nano" size), point at a camera feed, every object labeled in real time. YOLO26 docs | Ultralytics GitHub.

Whisper (speech-to-text)

OpenAI's open-source speech-to-text model. Trained on 680,000 hours of multilingual audio. Handles accents, background noise, 99 languages. Open-source, runs locally on a Jetson with no cloud needed.

Size	Parameters	Best for
Tiny	39M	Quick demos, low-power devices (32x real-time on GPU)
Base / Small	74M / 244M	Real-time on Jetson Orin Nano. Good accuracy/speed balance.
Medium / Large-v3	769M / 1.5B	Highest accuracy. Large-v3 has 128 Mel bins.
Large-v3 Turbo	809M	Recommended. 5.4x faster than Large with nearly identical accuracy.

Word Error Rate: 2.7% on clean English audio (benchmark), 5-6% in real-world conditions. Matches professional human transcriptionists on clean audio. Whisper GitHub.

DeepStream (multi-camera pipelines)

NVIDIA's SDK for building video analytics pipelines. Handles multiple camera streams: decoding, AI inference, tracking, alerts. Hardware-accelerated on Jetson. Used for smart security (4 cameras detecting people / packages / animals), retail analytics, industrial inspection, traffic monitoring. YOLO detects objects in a single image; DeepStream manages the whole pipeline. DeepStream.

Other perception models

Task	What it does	Common models
Depth estimation	Estimate distance from a single 2D image (no depth sensor needed)	MiDaS, ZoeDepth, Depth Anything v2
Semantic segmentation	Label every pixel ("road," "tree")	SAM3 (Meta), DeepLab
Pose estimation	Detect human body keypoints from images / video	YOLO26 Pose, MediaPipe, OpenPose
Face detection / recognition	Find faces, optionally identify who	RetinaFace, ArcFace, InsightFace
OCR	Read text from images. License plates, signs, documents.	PaddleOCR, EasyOCR, Tesseract

4. Decision and action models

Answer: "Given what I'm seeing, what should the robot do?" Take perception output and produce physical actions.

GR00T (NVIDIA's robot brain)

GR00T (Generalist Robot 00 Technology) is NVIDIA's family of foundation models for humanoid robots. Like GPT for text, but for robot control. Takes three inputs: camera images, language instructions, body state (joint positions). Outputs motor commands.

Version	Date	Key capability
GR00T N1	GTC 2025	First open foundation model for humanoid reasoning. Cross-embodiment (works across different robot bodies).
GR00T N1.5	2025	Generated using GR00T-Dreams synthetic data. 40% improvement combining synthetic + real data.
GR00T N1.6	CES 2026	Open reasoning VLA model. Full body control. Uses Cosmos Reason for context.
GR00T N1.7	May 2026	Whole-body humanoid control via Unitree G1. Predicts latent action tokens for coordinated locomotion + manipulation.

Cross-embodiment: a GR00T model trained on one robot body can be adapted to another with fine-tuning. Why it is a "foundation" model.

GR00T-Dreams: NVIDIA's synthetic data generation blueprint. Generated 780,000 synthetic trajectories (equivalent to 9 months of human demonstrations) in 11 hours using Isaac Sim. GR00T GitHub | Developer page.

Imitation learning vs reinforcement learning

	Imitation learning	Reinforcement learning (RL)
How it works	You demonstrate the task; robot reproduces it.	Robot tries things; gets rewards / penalties; improves over thousands of attempts.
Speed to set up	Fast.	Slow. Needs simulation (Isaac Lab) for safety and scale.
Generalization	Limited to demonstrated tasks.	More robust. Can discover novel strategies.
Compute	Less for training. Runs on a laptop.	Massive (GPU clusters in simulation).
Used by	LeRobot, SO-ARM101.	Robot dogs learning to walk. Isaac Lab.

Modern approaches combine both: start with imitation learning (fast, gets close to correct behavior), refine with RL (makes it robust). GR00T N1.7 uses this combined approach.

Navigation and path planning

Task	What it does	Key tool
SLAM	Build a map while tracking position within it. Foundational for any mobile robot.	Isaac ROS Visual SLAM, ORB-SLAM, rtabmap
Path planning	Calculate best route. Update in real time as obstacles move.	Nav2 (ROS2 navigation stack)
Obstacle avoidance	React to unexpected obstacles in real time.	Costmap layers in Nav2
Arm motion planning	Plan collision-free arm movements. Grasp planning.	MoveIt 2

5. World models

Answer: "If I do this action, what will happen next?" Let a robot "imagine" consequences before executing. The most cutting-edge area, with two fundamentally different approaches.

NVIDIA Cosmos (simulation-heavy)

Family of "world foundation models" that generate synthetic training data and predict physical outcomes.

Model	What it does
Cosmos Transfer 2.5	Real video to synthetic training data. Augments small real-world datasets.
Cosmos Predict 2.5	Predicts future states of physical environments.
Cosmos Reason 2	Open reasoning VLM. Perception layer for GR00T N1.6+.
Cosmos 3 (GTC Taipei, June 2026)	First open "omnimodel" unifying vision reasoning, world generation, and action generation. Available in Super (32B) and Nano (8B). The Nano variant runs on DGX Spark and RTX Spark consumer hardware, enabling local Physical AI development without cloud compute.

All Cosmos models open on Hugging Face. The NVIDIA approach: build detailed simulations, generate vast synthetic data, train on it, deploy. The Cosmos 3 Nano (8B) variant changes the accessibility story: with RTX Spark laptops shipping fall 2026 at consumer price points, Physical AI model experimentation moves from data-center-only to a laptop. Cosmos overview.

LeCun's JEPA (observation-heavy)

Yann LeCun (Turing Award) left Meta late 2025, founded AMI Labs in Paris. AMI raised $1.03B at $3.5B valuation March 2026. LeCun's argument: LLMs are a "dead end" for genuine intelligence because they predict text tokens without understanding the physical world.

JEPA (Joint Embedding Predictive Architecture): instead of predicting every pixel of what happens next, predicts in "latent space" (abstract representations). "The cup will fall" rather than generating a video of the cup falling. More efficient because it ignores unpredictable details and focuses on physics.

Model	Key result
V-JEPA 2 (June 2025)	1.2B parameters. Trained on 1M+ hours of video + 62 hours of robot data. 65-80% zero-shot robot control. 30x faster than Cosmos on video benchmarks.
V-JEPA 2.1 (March 16, 2026)	SOTA robot navigation. 10x faster planning. Improved training: all tokens (visible and masked) contribute to the self-supervised loss. Closes gap with DINOv2 on dense features.

The 62-hour number: V-JEPA 2 achieves competitive robot control from 62 hours of real robot data (on top of video pre-training). GR00T needed 780,000 synthetic trajectories. This efficiency difference is LeCun's core argument: learn from observation like a baby, do not simulate everything.

Honest assessment: V-JEPA results are real and published. $1B gives AMI Labs 4-5 years of runway. But AMI is months old and has shipped nothing commercially; NVIDIA has 110+ deployed partners. Either approach winning leaves perception, deployment, and governance skills relevant. V-JEPA 2 blog | paper | AMI Labs.

The governance gap

Neither Cosmos nor JEPA addresses governance. Who decides what the robot learns? What boundaries constrain its autonomous behavior? How do you audit a world model's internal representations? NVIDIA's NemoClaw is a first step (monitoring AI agent intent), but it is a security tool, not a governance framework. The governance layer guide covers this in depth.

6. Frameworks and platforms

The tools that connect runtime, models, and hardware into working robot applications. They handle camera input, model execution, motor output, communication between processes, simulation, training, deployment.

ROS2 (Robot Operating System version 2)

Not an actual OS. A set of libraries and tools for robot software: sensor processing, motor control, navigation, perception. The industry standard. Every serious robotics company uses it. Built around nodes (independent processes that do one thing), topics (named channels nodes publish / subscribe to), services (request / response between nodes), and actions (long-running goals like "navigate to point B").

Current release: ROS2 Jazzy Jalisco (LTS, 2024-2029). ROS2 docs.

The Isaac platform

NVIDIA's robotics platform family. Named after Isaac Asimov.

Component	What it does
Isaac Sim	Photorealistic 3D robotics simulator. Digital twins. Synthetic data generation. Built on NVIDIA Omniverse.
Isaac Lab	GPU-accelerated reinforcement learning inside Isaac Sim. Train thousands of robots simultaneously. Lab 3.0 early access at GTC 2026.
Isaac ROS	CUDA-accelerated ROS2 stack. GPU-powered perception, navigation, motor control on the robot. 10-100x faster than CPU-only ROS2.
Isaac GR00T	Foundation models for humanoid robots. The robot brain.

Newton (physics engine)

Open-source GPU physics engine built in collaboration with Google DeepMind and Disney Research. Newton 1.0 reached GA in April 2026. Makes simulation physics match real-world physics closely enough that sim-trained skills transfer to real robots. Disney Research uses Newton for its robotic character platform (Olaf BDX droids). Newton.

Hugging Face LeRobot

The accessible open-source robotics framework. PyTorch-based. Covers data collection, training, simulation, and deployment. Used with SO-ARM100 / 101 and Reachy Mini. Imitation learning is the primary approach.

NVIDIA Isaac models integrate into LeRobot as of CES 2026. The workflow: collect data with LeRobot, train with Isaac Lab, simulate in Isaac Sim, deploy on Jetson.

Reachy Mini App Store (May 6, 2026): Hugging Face launched an open-source app store for Reachy Mini with 200+ apps from 150 creators. ~10,000 Reachy Mini devices already shipped or in transit. Apps require no coding knowledge to build or use. Hugging Face's "ML Intern" agent generates and refines apps. The iOS App Store model, but for robot intelligence. Launch post | LeRobot.

AgenticROS (Claude-to-robot)

Open-source bridge connecting AI agents (Claude Code, Claude Desktop, Gemini, MCP) to ROS2 robots via natural language. Type commands in Claude, the robot executes. Created by Chris Matthieu. The lowest-friction path from software-AI skills (Claude, MCP) to physical robot control. agenticros.com.

NemoClaw (governance)

NVIDIA's security and governance layer for AI agents (announced GTC 2026). Monitors AI reasoning processes in real time and enforces safety guardrails. "Inspects the intent of the AI's logic." Works alongside AgenticROS. Covered in depth in the Governance guide.

Other tools

Tool	What it does
Roboflow	Computer vision model training and deployment. Annotate images, train custom YOLO models, deploy to edge.
Weights & Biases	ML experiment tracking, model versioning, collaboration.
Ultralytics	Makers of YOLO. YOLO26 is the latest.
MoveIt 2	Motion planning framework for robotic arms. Built on ROS2. Standard for manipulation.
Open Robotics	Maintainers of ROS2 and Gazebo.

7. How a typical Physical AI app composes

A simple example: a robotic arm picking up a cup.

Hardware (Layer 2): SO-ARM101 arm on a Jetson Orin Nano with a USB camera and a RealSense D435 depth sensor.
Runtime: JetPack 6.2 boots, CUDA 12.6 is ready, TensorRT 10.3 has optimized your perception model.
Perception: the USB camera streams images. YOLO26 detects "cup" with bounding box. The D435 reports depth: cup is 32 cm away.
Decision: LeRobot policy (trained earlier via imitation learning) takes the cup's position and outputs joint angles for each of the 6 servos.
Actuators: joint commands flow through ROS2 to the Feetech STS3215 servos. The arm moves, gripper closes.
Optional natural-language layer: AgenticROS lets you type "pick up the red cup" in Claude. NemoClaw watches the reasoning and stops execution if a guardrail trips.

8. Resource directory

Jetson AI Lab (hands-on tutorials for GenAI on Jetson)
JetPack SDK
Isaac platform
Newton physics
Cosmos world models
Isaac GR00T (GitHub)
Hugging Face LeRobot
ROS2 Jazzy docs
Ultralytics (YOLO)
Whisper
AgenticROS
V-JEPA (Meta AI)