<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>A100 on Rik Kisnah - Blog</title><link>https://www.rik-kisnah.ai/tags/a100/</link><description>Recent content in A100 on Rik Kisnah - Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 15 Nov 2025 00:00:00 -0700</lastBuildDate><atom:link href="https://www.rik-kisnah.ai/tags/a100/feed.xml" rel="self" type="application/rss+xml"/><item><title>The Complete NCCL Reference Guide: Commands, Errors, and Troubleshooting for OCI GPU Infrastructure</title><link>https://www.rik-kisnah.ai/posts/nccl-complete-reference-guide/</link><pubDate>Sat, 15 Nov 2025 00:00:00 -0700</pubDate><guid>https://www.rik-kisnah.ai/posts/nccl-complete-reference-guide/</guid><description>Disclaimer: This article reflects my personal research and analysis based on publicly available information and is not representative of my employer&amp;rsquo;s official position.
Executive Summary NCCL (NVIDIA Collective Communication Library) is the cornerstone of distributed GPU computing, enabling efficient communication between GPUs in multi-node clusters. This comprehensive guide provides every NCCL command, parameter, error message, and troubleshooting technique you need for successful deployment on Oracle Cloud Infrastructure (OCI).
Table of Contents Why NCCL Exists Understanding Collective Communications NCCL Fundamentals Complete NCCL Commands Reference All NCCL Environment Variables NCCL Error Messages and Solutions OCI GPU-Specific Configurations Advanced Troubleshooting Scenarios Performance Tuning Reference Quick Reference Tables Why NCCL Exists The Distributed Training Challenge Modern AI models have grown exponentially in size and complexity.</description></item></channel></rss>