Cloud storage for AI: Options, pros and cons


IT architects tasked with the design of storage systems for artificial intelligence (AI) need to balance capacity, performance and cost.

AI systems, especially those based on large language models (LLMs), consume vast amounts of data. In fact, LLMs or generative AI (GenAI) models often work better the more data they have. The training phase of AI in particular is very data hungry.

The inference phase of AI, however, needs high performance to avoid AI systems that feel unresponsive or fail to work at all. They need throughput and low latency.

So, a key question is, to what extent can we use a mix of on-premise and cloud storage? On-premise storage brings higher performance and greater security. Cloud storage offers the ability to scale, lower costs and potentially, better integration with cloud-based AI models and cloud data sources.

In this article, we look at the pros and cons of each and how best to optimise them for storage for AI.

AI storage: On-premise vs cloud?

Enterprises typically look to on-premise storage for the best speed, performance and security – and AI workloads are no exception. Local storage can also be easier to fine tune to the needs of AI models, and will likely suffer less from network bottlenecks.

Then there are the advantages of keeping AI models close to source data. For enterprise applications, this is often a relational database that runs on block storage.

As a result, systems designers need to consider the impact of AI on the performance of a system of record. The business will not want key packages such as ERP or CRM slowed down because they also feed data into an AI system. There are also strong security, privacy and compliance reasons for keeping core data records on site rather than moving them to the cloud.

Even so, cloud storage also offers advantages for AI projects. Cloud storage is easy to scale, and customers only pay for what they use. For some AI use cases, source data will already be in the cloud, in a data lake or a cloud-based, SaaS application, for example.

Cloud storage is largely based around object storage, which is well-suited to the unstructured data which makes up the bulk of information consumed by large language models.

At the same time, the growth of storage systems that can run object storage on-premise makes it easier for enterprises to have a single storage layer – even a single global namespace – to serve on-premise and cloud infrastructure, including AI. This is especially relevant for firms that expect to move workloads between local and cloud infrastructure, or operate “hybrid” systems.

AI storage, and cloud options

Cloud storage is often the first choice for enterprises that want to run AI proofs-of-concept (PoCs). It removes the need for up-front capital investment and can be spun down at the end of the project.

In other cases, firms have designed AI systems to “burst” from the datacentre to the cloud. This makes use of public cloud resources for compute and storage to cover peaks in demand. Bursting is most effective for AI projects with relatively short peak workloads, such as those that run on a seasonal business cycle.

But the arrival of generative AI based on large language models has tipped the balance more towards cloud storage simply because of the data volumes involved.

At the same time, cloud providers now offer a wider range dedicated data storage options focused on AI workloads. This includes storage provision tailored to different stages of an AI workload, namely: prepare, train, serve and archive.

As Google’s engineers put it: “Each stage in the ML [machine learning] lifecycle has different storage requirements. For example, when you upload the training dataset, you might prioritise storage capacity for training and high throughput for large datasets. Similarly, the training, tuning, serving and archiving stages have different requirements”

Although this is written for Google Cloud Platform, the same principles apply to Microsoft Azure and Amazon Web Services. All three hyperscalers, plus vendors such as IBM and Oracle, offer cloud-based storage suitable for the bulk storage requirements of AI. For the most part, unstructured data used by AI, including source material and training data, will likely be held in object storage.

This could be AWS S3, Azure Blob Storage, or Google Cloud’s Cloud Storage. In addition, third-party software platforms, such as NetApp’s ONTAP are also available from the hyperscalers, and can improve data portability between cloud and on-premise operations.

For the production, or inference stage, of AI operations, the choices are often even more complex. IT architects can specify NVMe and SSD storage with different performance tiers for critical parts of the AI workflow. Older “spinning disk” storage remains on offer for tasks such as initial data ingest and preparation, or for archiving AI system outputs.

This type of storage is also application neutral: IT architects can specify their performance parameters and budget for AI as they can for any other workload. But a new generation of cloud storage is designed from the ground up for AI.

Advanced cloud storage for AI

The specific demands of AI has prompted storage vendors to design dedicated infrastructure to avoid bottlenecks in AI workflows, some of which are found in on-prem systems but also in the cloud. Key among them are two approaches: parallelism and direct GPU memory access.

Parallelism allows storage systems to handle what storage supplier Cloudian describes as “the concurrent data requests characteristic of AI and ML workloads”. This makes model training and inference faster. In this way, AI storage systems are enabled to handle multiple data streams in parallel.

An example here is Google’s Parallelstore, which launched last year to provide a managed parallel file storage service aimed at intensive input/output for artificial intelligence applications.

GPU access to memory, meanwhile, sets out to remove bottlenecks between storage cache and GPUs – GPUs are expensive and can be scarce. According to John Woolley, chief commercial officer at vendor Insurgo Media, storage must deliver at least 10GBps of sustained throughput to prevent “GPU starvation”.

Protocols such as GPUDirect – developed by Nvidia – allow GPUs to access NVMe drive memory directly, similarly to the way RDMA allows direct access between systems without involving CPU or the OS. It also goes by the name DGS or Direct GPU Support (DGS).

Local cache layers between the GPU and shared storage can use block storage on NVMe SSDs to provide “bandwidth saturation” to each GPU, at 60GBps or more. As a result, cloud suppliers plan a new generation of SSD, optimised for DGS and likely to be based on SLC NAND.

“Inference workloads require a combination of traditional enterprise bulk storage and AI-optimised DGS storage,” says Sebastien Jean, CTO at Phison US, a NAND manufacturer. “The new GPU-centric workload requires small I/O access and very low latency.”

As a result, the market is likely to see more AI-optimised storage systems, including those with Nvidia DGX BasePod and SuperPod certification, and AI integration.

Options include Nutanix Enterprise AI, Pure’s Evergreen One for AI, Dell PowerScale, Vast’s Vast Data Platform, Weka, a cloud hybrid NAS provider, and offerings from HPE, Hitachi Vantara, IBM and NetApp.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *