#Background
At VOIAR(Vision Only Intelligent Autonomous Robot), we were developing a forest-ready robot (codenamed “5Earl”) with a person-following feature. The system split duties between:
- A Raspberry Pi for low-level control and sensor aggregation.
- A Jetson Nano for vision inference and communications.
Initially, the robot could be steered via a mobile web app (replacing the previous laptop + PS4 controller workflow). The next milestone: let it follow a person autonomously. Imagine a 3 wheeler helper trailing you through a forest. You can throw a load upto 120kg on it.
#The Challenge
YOLO (“You Only Look Once”) was our go-to for real-time object detection. However:
- Jetson Nano's JetPack 4.6 supports only CUDA 10.2, while PyTorch's official JetPack images targeted newer releases.
- No existing Docker image combined:
- JetPack 4.x support
- PyTorch 1.11
- TensorRT 8.2.0.6
- ONNX→TensorRT export workflow
Goal: Build a Docker image that runs YOLO with maximum GPU utilization on Jetson Nano.
#Solution
#1. Custom Docker Base Image
I started from NVIDIA's l4t-pytorch:r35.2.1-pth2.0-py3
(PyTorch 2.0 + CUDA 11.7), then:
- Downgraded to PyTorch 1.11 (compatible with CUDA 10.2).
- Installed TensorRT 8.2.0.6 and related dependencies.
- Exposed the correct
/usr/lib/aarch64-linux-gnu/
paths for all CUDA, cuDNN, and TensorRT libraries.
#2. Model Export Workflow
To squeeze out every bit of performance:
- Convert the
.pt
model to ONNX. - Warm up the ONNX graph with a sample image (on the target Jetson Nano).
- Build a
.engine
file using TensorRT's Python API on-device. - Load the
.engine
at runtime for inference.
import json # Using json instead of yamlimport osimport shutilimport timefrom datetime import datetimefrom ultralytics import YOLO# Get today's date in the required formattoday_date = datetime.now().strftime("[%Y-%m-%dT%H:%M]")# Define the base directorybase_dir = f"/home/ftpuser/{today_date}"class ModelTester:def __init__(self, model_name, img_size=(640, 480)):self.img_size = img_sizeself.model_name = model_nameself.best_model_path = os.path.join(base_dir, f"{model_name}.pt")self.settings_path = os.path.join(base_dir, "settings.json") # Change to .jsondef create_model(self):self.pt_model = YOLO(f"./assets/{self.model_name}.pt") # Build a new model from the .yaml configurationself.pt_model.predict("https://ultralytics.com/images/bus.jpg", imgsz=(self.img_size[1], self.img_size[0])) # Predict on an imagedef export_model(self):# Export the model to TensorRT format with INT8 quantizationself.pt_model.export(format="engine", device="cuda", imgsz=(self.img_size[1], self.img_size[0]), half=True)def move_the_model(self):os.makedirs(base_dir, exist_ok=True)# All the exported models are saved in `runs/detect/train/weights` by default# shutil.copy(f"runs/detect/train/weights/{self.model_name}.pt", self.best_model_path)shutil.move(f"./assets/{self.model_name}.engine", os.path.join(base_dir, f"{self.model_name}.engine"))shutil.move(f"./assets/{self.model_name}.onnx", os.path.join(base_dir, f"{self.model_name}.onnx"))print(f"Moved the files to {base_dir} 🤝🔥🔥")self.save_settings()def save_settings(self):settings = {"model_name": self.model_name,"img_size": self.img_size,"training_data": "coco.yaml","epochs": 1,"classes": [0],"device": "cuda","dynamic": True,"batch": -1,}with open(self.settings_path, "w") as file:json.dump(settings, file)print(f"Saved settings to {self.settings_path}")if __name__ == "__main__":tester = ModelTester("yolov8_20240717_coco(imgsz480x640)")tester.create_model()tester.export_model()tester.move_the_model()time.sleep(60000) # Arbitrary delay to simulate extended operation
This ensures the final inference runs entirely in TensorRT, bypassing slower PyTorch C++ ops.
#Results
Metric | Improvement (%) |
---|---|
Total inference time (300 images) | –41.97% |
Avg. inference time per image | –43.33% |
Frames Per Second (FPS) | +72.29% |
Images Per Minute | +72.31% |
#Conclusion
By crafting a tailored Docker image and integrating a robust ONNX → TensorRT export pipeline, I delivered substantial performance gains on legacy Jetson devices. This patch has been merged into Ultralytics main repo and is actively used by over 30,000 developers and projects.
- Repo & utilities: MWLCDev/Yolo-Export
- Training data: Freely available on RoboFlow
- Pull request discussion: ultralytics/ultralytics#13100
Feel free to dive into the code, reproduce the benchmarks, or adapt this image for your own Jetson-based edge deployments!