The training of modern large-scale models and High-Performance Computing (HPC) rely on thousands of GPUs and compute nodes. These devices require continuous data exchange. Network switches interconnect all these nodes and enable data transmission between GPUs and servers. Without high-performance switches, even systems equipped with top-tier hardware cannot operate at full speed. In short, switches directly determine the execution speed and stability of AI training and HPC tasks.
What Challenges Do Large Model Training and HPC Face?
Large model training and HPC have strict network requirements. Both workloads share two main challenges.
First, they need high bandwidth. Large models transfer gradients, weights, and activation data per second. HPC simulations exchange real-time calculation results. This requires moving terabytes of data quickly. Slow bandwidth leads to data jams.
Secondly, they need to be ultra-low latency. Delay refers to the delay time of data transmission. Even a small delay will make the graphics processor wait. The waiting time will continue to accumulate in hours or even days, which may increase the overall task time-consuming to the original.Twice or even three times.
Legacy switches cannot solve these problems. They become the main bottleneck for AI and HPC clusters.

What Is an NDR 400Gb/s InfiniBand Switch?
An NDR 400Gb/s InfiniBand switch is a high-speed network device. It follows the InfiniBand standard and supports NDR (Next Generation Data Rate) 400Gb/s transmission rate. It is mainly used to connect thousands of GPUs and computing nodes, and can maintain the data transmission delay at an extremely low level.
The main features include non-blocking architecture, on-board subnet manager (SM) and in-network computing technology (SHARP). Mellanox MQM9700-HS2F (NVIDIA Quantum-2 QM9700) is a typical 1U specification NDR switch, specially designed for artificial intelligence training and high-performance computing (HPC) environments. As a leading Mellanox IB switch, it delivers ultra-low latency and high bandwidth.
Why NDR 400Gb/s Speed Matters for AI and HPC
NDR 400Gb/s is much faster than older speeds like HDR 200Gb/s. It doubles the bandwidth. This change can improve cluster performance.
For AI training, distributed jobs need communication between GPUs.The higher the bandwidth,the faster the data exchange. Training time drops.
For HPC, scientific simulations need high-throughput and low-latency data flow.NDR 400Gb/s can achieve uninterrupted and continuous operation of computing services.
Core Technologies of High-Performance InfiniBand Switches
NVIDIA Quantum-2 ASIC and SHARP v3
The core of a good InfiniBand switch is its ASIC. The Mellanox MQM9700-HS2F uses the NVIDIA Quantum-2 ASIC. It supports NDR 400Gb/s speed.
It also includes SHARP v3. SHARP v3 is an in-network computing technology that offloads cluster communication operations from servers to switches and reduces data transmission volume by up to 32 times, thereby accelerating artificial intelligence training tasks.
Ultra-Low Latency Design
Latency directly affects GPU efficiency. The Mellanox MQM9700-HS2F delivers <100ns port-to-port latency. This is one of the lowest levels in the industry. It minimizes waiting time between nodes.
Onboard Fabric Orchestration
Large clusters need simplified management. Mellanox MQM9700-HS2F has a built-in subnet manager (SM), which can control the entire InfiniBand network architecture and support up to 2000 nodes without additional configuration of the server. The device is equipped with the MLNX-OS operating system, which can achieve convenient control through the command line interface, web user interface or simple network management protocol.
Deep Dive: Mellanox MQM9700-HS2F (NVIDIA Quantum-2 QM9700)
The Mellanox MQM9700-HS2F is a leading NDR 400Gb/s InfiniBand switch . It is designed for AI training and HPC clusters.
Core Features
64 × 400Gb/s NDR InfiniBand ports (32 OSFP connectors)
51.2Tb/s non-blocking switching capacity
66.5B+ packets per second forwarding rate
1U rack-mount form factor
P2C (port-to-cable) airflow
1+1 redundant 2000W AC power supplies
N+1 redundant hot-swap fans
Technical Specifications
| Specification | Details |
| Model | MQM9700-HS2F (Quantum-2 QM9700) |
| Ports | 64 × 400Gb/s NDR InfiniBand |
| Latency | <100ns (port-to-port) |
| CPU | x86 Coffee Lake i3 |
| Memory | 8GB DDR4 |
| Airflow | P2C (port-to-cable) |
| Power | 2×2000W AC (1+1 redundant) |
How to Choose the Right NDR InfiniBand Switch
Step 1: Assess workload scale
Count the number of GPUs and compute nodes to determine whether a Spine-Leaf or Fat-Tree topology is required, establishing the necessary number of ports.
Step 2: Verify hardware compatibility
Please check if the switch supports the existing unlimited bandwidth rate. The MQM9700-HS2F model supports HDR, EDR, FDR and QDR rates and is compatible with both new and old devices.
Step 3: Evaluate form factor and cooling
Please select the 1U form factor to save rack space. Please ensure that the airflow direction matches your data center environment; in modern facilities, P2C (port-to-cable) airflow patterns are very common.
Step 4: Plan for redundancy and uptime
Select switches equipped with redundant power supplies and fans. A hot-swap design can maintain equipment without interrupting cluster operations. The Mellanox MQM9700-HS2F switch meets all of these requirements.
Real-World Use Cases
AI Training Clusters
Large language models adopt the Mellanox MQM9700-HS2F. SHARP v3 reduces data movement. Training time becomes shorter. GPU utilization improves.
HPC Scientific Computing
Weather forecast and engineering simulation both rely on this switch. High bandwidth and low latency ensure continuous data transmission,so that simulation tasks can be completed faster.
Hyperscale Data Centers
Cloud service providers use this switch for large-scale clusters. 64 interfaces can access a large number of nodes, and the 1U specification saves space.
Hybrid Legacy Fabrics
Data centers with old HDR devices can add this switch. Backward compatibility allows smooth upgrades. No full replacement is needed.
FAQ
Q: What is the difference between NDR and HDR InfiniBand?
A: NDR InfiniBand operates at 400Gb/s, while HDR InfiniBand runs at 200Gb/s. The bandwidth of NDR is twice that of HDR, and supports in-network computing in SHARP v3.
Q: How many nodes can the MQM9700-HS2F support?
A:MQM9700-HS2F built-in subnet manager can support up to 2000 nodes.The switching architecture is suitable for small, medium and large clusters.
Q: Is the MQM9700-HS2F backward compatible?
A: Yes.The MQM9700-HS2F is backward compatible with HDR, EDR, FDR, and QDR versions of InfiniBand, enabling seamless compatibility and integration with existing InfiniBand hardware.
Conclusion
Network switches are core devices for large-scale model training and supercomputing, responsible for controlling the data flow between all nodes. High-performance switches can eliminate performance bottlenecks and significantly improve operating speed and scalability.
The Mellanox MQM9700-HS2F (NVIDIA Quantum-2 QM9700) is undoubtedly the top choice. It offers high-speed transmission capability of 400Gb/s NDR, ultra-low latency, and supports SHARP v3 technology; at the same time, it can perfectly support the deployment needs of large clusters and hybrid environments.
Selecting and deploying the right switches can not only improve the overall performance of the cluster, but also effectively save time and costs.