Cloud Infrastructure Considerations for Artificial Intelligence

Jun 28, 2024

Digital cloud floating in a cyber environment

The Value of Cloud Computing for AI

Generative AI deployments are becoming a top priority for many enterprises. Expanding infrastructure and resources to support those workloads may be the largest challenge IT leaders are facing. While modernizing existing on-premises data centers can be a great way to optimize space, power, and OPEX for new AI deployments, it isn’t the only option. Many organizations use cloud computing as their primary resource and need to determine the best way to leverage it for AI deployment.

Deploying new workloads like AI in the cloud can minimize CAPEX but can also lead to a significant increase in OPEX. Because of this, choosing virtual machines (VMs) with the right processor for a given workload is as important as choosing the right server platform in the data center. Newer, higher performance VMs, though more expensive on a per-VM basis, can have performance-driven cost savings by doing more work-per-VM and lowering the total number needed to meet demand. Optimizing VM selection for existing workloads can streamline OPEX spending and ensure that resulting savings are directed toward AI deployment in the cloud.

Organizations looking to implement AI in their data centers can also benefit from cloud solutions. With the widespread demand for AI applications, getting physical hardware to scale on-premises AI resources can be difficult, expensive, and prone to supply-chain delays. Organizations that use cloud computing may have an easier time getting AI projects up and running quickly, even if they intend to run AI in the data center long term. This is especially true when working with the largest cloud providers who often get quicker access to new hardware at scale.

Also, organizations with a hybrid-cloud model, running compute-intensive workloads on-premises while running general purpose applications in the cloud, may alleviate the on-premises computing burden by shifting more work to the cloud.

Challenges in the Cloud

Deploying AI in the cloud can seem simple, but there are some important considerations to a successful deployment. Cloud providers offer hundreds of different VMs with varying CPUs and memory configurations, each with strengths and weaknesses. Finding the right VM for AI workloads is critical, but so is finding the best VM for pre-existing enterprise workloads, like data management for AI training/inference. This holistic approach allows cloud-based organizations to get the most from their cloud footprint across all aspects of their business.

Often, this means moving from older VM types, based on previous generations of CPU, to newer instances that maximize performance-driven cost efficiency. Beyond this, it also means identifying the volume of work for each key workload, AI and non-AI, and finding the right balance of different instances to optimally meet those needs. For example, teams may find that general purpose instances featuring AMD EPYC™ 9004 series processors are ideal for handling administrative workloads while minimizing the total number of VMs utilized. Meanwhile, GPU-accelerated instances can be better suited to training AI models, and memory-optimized instances perform best for running the high-performance databases that underpin any successful AI deployment. Without taking this approach, organizations can run into operational inefficiencies and spend more to do less.

Choosing the Right Cloud VM for AI

Web/App Tier
NGINX	Server-side Java	FFmpeg
~1.2x Requests/Sec	~1.7x Max Ops, multi-instance	~1.4x Frames/Sec
23% Savings	45% Savings	32% Savings

Data Tier
MySQL	Redis
~1.2x Transactions/Min	~1.4x Requests/Sec
22% Savings	32% Savings

GCP C3D Standard-16 vs GCP N2 Standard-16

In many situations, choosing the lowest priced VMs that meet performance requirements can end up costing more than VMs running on the latest CPUs. Instead, the right cloud VM is often the best performance-per-dollar VM for the specific workload. In this regard, AI applications should be no different than general purpose, high-performance computing, or big data analytics workloads. As an example for currently popular cloud workloads, customers running in Google Cloud can opt for the C3D VM series, built with AMD EPYC™ 9004 processors. These new VMs provide a 37% performance uplift on average across a selection of popular workloads when compared to the previous generation competitive offering.¹

When considering cloud computing OPEX, understanding the cost associated with each ‘unit of work’ is critical to properly deciding on VMs. With a performance uplift, more work can be done with each VM purchased. This can drive down costs, shrink the VM pool, and free up resources to expand AI projects in the cloud.

By choosing cloud VMs built with AMD EPYC™ processors, IT leaders are giving their teams fantastic value in the cloud and creating more opportunities for AI development and deployment in the cloud.

To learn about how enterprise organizations can approach AI in the cloud, read this article from the Wall Street Journal.

To learn more about how AMD is optimizing cloud computing, visit the Cloud Computing Page.

Article By

Ram Peddibhotla

Vice President and General Manager, Cloud Business, AMD

SP5C-006: MySQL™, Redis®, NGINX®, server-side Java multi-instances, and FFmpeg™ comparison of Google Cloud C3D-standard 16 vCPU to N2-standard 16 vCPU based on AMD testing on 11/02/23. Cloud OPEX savings calculated based on on-demand pricing at https://cloud.google.com/compute/vm-instance-pricing for us-central1 (Iowa) as of 11/01/2023. Configurations both with 64GB running Ubuntu 22.04.3 LTS.
Comparisons
MySQL 8.0.28 HammerDB 4.2 TPROC-C (~1.2x tpm, 22% Cloud OPEX savings),
Redis 7.2 get/set: (~1.4x rps, 32% Cloud OPEX savings),
NGINX 1.1.9-2 WRK 4.2: (~1.2x ops/sec, 23% Cloud OPEX savings),
server-side Java® multi instances max-OPS (~1.7x OPS, 45% Cloud OPEX savings)
FFmpeg 4.4.2.0 Ubuntu 22.04.1 h264-vp9, raw_h264, raw_vp9, vp9_h264 at 1080p (~1.4x frames/hr, 32% Cloud OPEX savings).
Cloud performance results presented are based on the test date in the configuration. Results may vary due to changes to the underlying configuration, and other conditions such as the placement of the VM and its resources, optimizations by the cloud service provider, accessed cloud regions, co-tenants, and the types of other workloads exercised at the same time on the system.

Data Center

Business Systems

Personal & Gaming

Embedded

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Industries

Workloads

Gaming

Systems

Technologies

Resources

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators

Adaptive SoCs, FPGAs, & SOMs

Graphics

Overview

Resources by Market Segment

Resources by Product

Resources by Type

About Our Partners

AMD Global Support

Processors & Graphics

Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

Cloud Infrastructure Considerations for Artificial Intelligence

Article By

Resources

Footnotes