Inside the blueprint on center software

3/21/2024

This document provides a best practice blueprint for building a modern network environment that will allow AI/ML workloads to run at their best using shipped hardware and software features. Coupled with tools such as Cisco Nexus Dashboard Insights for visibility and Nexus Dashboard Fabric Controller for automation, Cisco Nexus 9000 switches become ideal platforms to build a high-performance AI/ML network fabric. The Cisco Nexus 9000 switches have the hardware and software capabilities available today to provide the right latency, congestion management mechanisms, and telemetry to meet the requirements of AI/ML applications. Finally, AI applications should take advantage of automation frameworks to make sure the entire network fabric is configured correctly and there is no configuration drift. AI applications also need networks that can provide visibility into hot spots so they can be tuned as necessary. To achieve this, network administrators need to deploy the right hardware and software features, along with a configuration that supports AI application needs. Because of this, AI workloads have stringent infrastructure requirements.ĪI applications take advantage of-and expect-low latency, lossless networks. When communication between the server clusters involved in learning cycles has high latency, or packet drops, the learning job can take much longer to complete, or in some cases fail. Some of the learning cycles discussed above can take days, or even weeks, to complete with very large data sets.

Examples of this are applications in smart phones or self-driving cars. Inference systems may use smaller data sets but be hyper-scaled to many devices. Deep learning systems are optimized to handle large amounts of data to process and re-evaluate results. Inference clusters can have different requirements and are optimized for performance. Inference frameworks take the knowledge from trained neural network models and apply them to new data to predict the outcomes. As expected, these applications generate high volumes of data that must be collected and processed in real time and are shared across multiple devices sometimes numbering in the thousands. Training deep learning clusters with large data sets can increase their predictive accuracy. These servers often have dual 100Gb network interface cards (NICs) connected to separate switches, with strict networking requirements.ĭeep learning models have highly flexible architectures that allow them to learn directly from raw data. Neural networks take advantage of GPU clusters that can be made up of thousands of GPUs, usually with several GPUs per server. In many cases, building ML applications starts with training deep neural networks with large datasets across multiple iterations. These applications can be used for many purposes such as advanced medical research, computer-aided drug discovery, natural language processing, self-driving vehicles, making shopping recommendations, and recognizing images in a video stream. The availability of better server hardware along with commonly used programming languages such as Python and C/C++, and frameworks such as PyTorch, TensorFlow, and JAX, which are built to take advantage of GPUs natively, have simplified the building of GPU-accelerated ML applications. Today, widely available GPU-accelerated servers create the flexibility to design and train custom deep neural networks. ML is the ability of computer systems to learn to make decisions and predictions from observations and data.

Machine learning, a subset of AI, is one of the most common applications. 16Īrtificial intelligence and machine learning (AI/ML) applications are becoming increasingly commonplace in data centers. Using Cisco Nexus Dashboard Fabric Controller to Automate Your AI/ML Network. Network Design to Accommodate the Best Performance of an AI/ML Cluster 15 How Visibility into Network Behavior Improves Transport and Troubleshooting. Using ECN and PFC Together to Build Lossless Ethernet Networks. How to Manage Congestion Efficiently in AI/ML Cluster Networks.

0 Comments

Inside the blueprint on center software

Leave a Reply.

Author

Archives

Categories