This page contains press release content distributed by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Bluente Launches Open-Source MCP Server, Bringing Format-Preserving Document Translation Directly Into AI Workflows

Bluente Launches Open-Source MCP Server, Bringing Format-Preserving Document Translation Directly Into AI Workflows

New integration lets AI agents translate documents across 120+ languages without leaving the tools developers and

March 11, 2026

Marilyn Suey, Founder of The Diamond Group Wealth Advisors, Warns Taxes, Creditors, Divorce Can Threaten Family Legacy

Marilyn Suey, Founder of The Diamond Group Wealth Advisors, Warns Taxes, Creditors, Divorce Can Threaten Family Legacy

Without the right strategies in place, taxes, lawsuits, and even family circumstances like divorce can erode what you

March 11, 2026

Aerogelic Ballooning Marks Nearly Five Decades of Safe Hot Air Balloon Operations Amid Growing Adventure Tourism Market

Aerogelic Ballooning Marks Nearly Five Decades of Safe Hot Air Balloon Operations Amid Growing Adventure Tourism Market

AZ, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Aerogelic Ballooning, a hot air balloon company operating in

March 11, 2026

New Insights Highlight Importance of Outcome‑Focused Project Delivery

New Insights Highlight Importance of Outcome‑Focused Project Delivery

Successful projects prioritise outcomes over outputs to deliver real organisational value. Adopting an outcome-focused

March 11, 2026

The Five Case Model: Strengthening Public Investment Through Smarter, Evidence‑Based Decision Making

The Five Case Model: Strengthening Public Investment Through Smarter, Evidence‑Based Decision Making

The Five Case Model helps governments make smarter, evidence‑based investment decisions that deliver real public value.

March 11, 2026

Why certification and real skills matter more in an AI-driven workplace

Why certification and real skills matter more in an AI-driven workplace

In an AI-driven world, verified certification and real-world skills are essential for proving authentic professional

March 11, 2026

Certified Aviation Services Recognized by Department of Defense as Approved SkillBridge Industry Partner

Certified Aviation Services Recognized by Department of Defense as Approved SkillBridge Industry Partner

Program will provide transitioning service members with hands-on exposure to civilian aviation maintenance operations

March 11, 2026

Midtown Las Vegas Secures C-PACE Approval for Next Phase of Mixed-Use Tower Development

Midtown Las Vegas Secures C-PACE Approval for Next Phase of Mixed-Use Tower Development

Securing C-PACE approval is an important step in executing a disciplined and forward-looking capital strategy”— Anna

March 11, 2026

Copper Tech Introduces Copper-Infused Compression Golf Gloves Designed to Support Grip and Hand Comfort

Copper Tech Introduces Copper-Infused Compression Golf Gloves Designed to Support Grip and Hand Comfort

Direct-to-consumer golf brand expands focus on performance gear designed for comfort, grip stability, and durability

March 11, 2026

Iffel International and WunderMarx Form Strategic Alliance to Strengthen Corporate Reputation in the Age of AI

Iffel International and WunderMarx Form Strategic Alliance to Strengthen Corporate Reputation in the Age of AI

Collaboration integrates marketing and public relations to support revenue growth, investor confidence and market trust

March 11, 2026

Idlewild Burg, Inc. Acquires Language Solutions, Inc., Expanding Language Access Across the Americas and Globally

Idlewild Burg, Inc. Acquires Language Solutions, Inc., Expanding Language Access Across the Americas and Globally

Acquisition brings together LSI, Korn, and Zaum to deliver scalable, ISO certified language and accessibility solutions

March 11, 2026

Structured Press Releases Maintain Relevance as Artificial Intelligence Reshapes Online Search

Structured Press Releases Maintain Relevance as Artificial Intelligence Reshapes Online Search

Artificial intelligence systems rely on context and structure when interpreting information across the internet”— Brett

March 11, 2026

BingerLabs Unveils PRO Line: The Next Step in the Evolution of Pain Recovery & Restorative Wellness

BingerLabs Unveils PRO Line: The Next Step in the Evolution of Pain Recovery & Restorative Wellness

Science-Driven Performance Solutions Rooted in the Integration of Mind, Body, and Spirit Wellness is not isolated to

March 11, 2026

Primary Care Physicians Play Key Role in Diagnosing and Treating Minor Skin Conditions

Primary Care Physicians Play Key Role in Diagnosing and Treating Minor Skin Conditions

Primary care providers routinely evaluate a wide range of skin conditions during everyday appointments”— Chad Carrone

March 11, 2026

Thompson Builders Completes $22.1M Yerba Buena Island Hillcrest Road Improvement Project

Thompson Builders Completes $22.1M Yerba Buena Island Hillcrest Road Improvement Project

SAN FRANCISCO, CA, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Thompson Builders Corporation (TBC) announces

March 11, 2026

CELSIUS Rock ‘n’ Roll Running Series Las Vegas Takes Over The Las Vegas Strip for the World’s Largest Running Party

CELSIUS Rock ‘n’ Roll Running Series Las Vegas Takes Over The Las Vegas Strip for the World’s Largest Running Party

I think it is truly a party, and it is an honor to be able to run in such a unique place as this. Having…

March 11, 2026

Family Law Attorney Krista Nash Shares Research-Based Strategies to Reduce Conflict and Protect Children During Divorce

Family Law Attorney Krista Nash Shares Research-Based Strategies to Reduce Conflict and Protect Children During Divorce

ARVADA, CO – March 11, 2026 – PRESSADVANTAGE – Children First Family Law has announced the publication of a new article

March 11, 2026

RestorePro Strengthens Regional Disaster Preparedness Partnerships

RestorePro Strengthens Regional Disaster Preparedness Partnerships

SANDUSKY, OH – March 11, 2026 – PRESSADVANTAGE – RestorePro Disaster Cleanup & Restoration has announced the

March 11, 2026

T-RAN Releases New Gospel Single ‘More of You’, Inviting Listeners Into a Message of Faith and Surrender

T-RAN Releases New Gospel Single ‘More of You’, Inviting Listeners Into a Message of Faith and Surrender

This song came from a season where I had nothing left to give but my yes”— T-RAN CHATTANOOGA, TN, UNITED STATES, March

March 11, 2026

LaPour Achieves 100% Lease-Up at Creekside Centennial Tech Center in Centennial, CO

LaPour Achieves 100% Lease-Up at Creekside Centennial Tech Center in Centennial, CO

Creekside was intentionally designed to serve small and mid-sized users who need dock and drive-in loading, higher

March 11, 2026

Red Coral Universe Supports Independent Filmmakers at 2nd Annual LATNBFF

Red Coral Universe Supports Independent Filmmakers at 2nd Annual LATNBFF

The Los Angeles-based film festival takes place on Thursday, March 12 followed by limited run streaming on Red Coral

March 11, 2026

Andy Cooney Releases New Single & Music Video for ‘Everybody’s Irish (You Know The Way)’

Andy Cooney Releases New Single & Music Video for ‘Everybody’s Irish (You Know The Way)’

The new single and music video for "Everybody's Irish (You Know The Way)" is the St. Patrick's Day anthem nobody knew

March 11, 2026

Pet Business Insurance to Connect With Grooming Professionals at GROOM’D 2026 Expo

Pet Business Insurance to Connect With Grooming Professionals at GROOM’D 2026 Expo

Pet industry insurance specialists will offer policy reviews and coverage guidance for grooming professionals at

March 11, 2026

Vikram Reddy Shares Enterprise Data Warehouse Playbook from Medicaid and AIG

Vikram Reddy Shares Enterprise Data Warehouse Playbook from Medicaid and AIG

Real-world healthcare data architecture lessons from a data engineer who built large-scale Medicaid and insurance

March 11, 2026

Short-Form Video Emerges as a Foundational Element in Modern Digital Marketing Strategy

Short-Form Video Emerges as a Foundational Element in Modern Digital Marketing Strategy

Short-form video reflects a fundamental change in how information moves through the internet”— Brett Thomas NEW

March 11, 2026

Kommerce channels 80s–90s graffiti into new streetwear collection

Kommerce channels 80s–90s graffiti into new streetwear collection

In homage to NYC’s pioneering graffiti era, the brand’s designs emphasize storytelling over tags, turning garments into

March 11, 2026

Vishal & Sheykhar Announce U.S. Return with ‘The Superhit Tour’ in July 2026

Vishal & Sheykhar Announce U.S. Return with ‘The Superhit Tour’ in July 2026

Vishal & Sheykhar bring “The Superhit Tour” to San Jose, Dallas, Nashville, and South Florida following a sold-out

March 11, 2026

UAMM Announces Corporate Name Change to MTGH Reflecting New Management and Strategic Direction

UAMM Announces Corporate Name Change to MTGH Reflecting New Management and Strategic Direction

Mettitech Group Holdings Inc. (OTCMKTS:MTGH)HUNTINGTON BEACH, CA, UNITED STATES, March 11, 2026 /EINPresswire.com/ —

March 11, 2026

Differences Between Stimulant and Non-Stimulant Medications Shape Treatment Approaches for Attention Deficit Disorder

Differences Between Stimulant and Non-Stimulant Medications Shape Treatment Approaches for Attention Deficit Disorder

Stimulant medications and non-stimulant medications affect brain chemistry through different mechanisms”— Dr. Stanford

March 11, 2026

Scottsdale’s Serenity Smiles Helping Patients Rediscover Confidence Through Healthy, Beautiful Smiles

Scottsdale’s Serenity Smiles Helping Patients Rediscover Confidence Through Healthy, Beautiful Smiles

Studies show up to 70% of adults feel self-conscious about their smile. Serenity Smiles in Scottsdale helps patients

March 11, 2026

U.S. Tax Season Approaches: Key IRS Filing Requirements for Foreign Owned LLCs and International Entrepreneurs

U.S. Tax Season Approaches: Key IRS Filing Requirements for Foreign Owned LLCs and International Entrepreneurs

International entrepreneurs with U.S. LLCs must navigate IRS tax season deadlines, Form 5472 reporting rules, and key

March 11, 2026

George Kent Expands U.S. Brand Presence Through Collaborations With Latin Entertainment Figures

George Kent Expands U.S. Brand Presence Through Collaborations With Latin Entertainment Figures

Carlos Ponce and Carlos Adyan highlight the brand’s refined menswear aesthetic through recent public appearances MIAMI,

March 11, 2026

Cromwell Manor Inn Defines the New Five-Star Bed and Breakfast Criteria for 2026 Weekend Travelers

Cromwell Manor Inn Defines the New Five-Star Bed and Breakfast Criteria for 2026 Weekend Travelers

Cromwell Manor Inn shares new five-star bed and breakfast criteria for 2026, highlighting personalized hospitality and

March 11, 2026

Clapper Rolls Out Clubs Feature, Giving Creators a New Way to Build Communities

Clapper Rolls Out Clubs Feature, Giving Creators a New Way to Build Communities

The creator-first platform introduces dedicated community hubs where users can gather around shared passions, hobbies,

March 11, 2026

ANZZI Reports Wet Rooms Reach 1-in-6 Bathrooms—How Bathroom Glass Shower Doors Are Redefining Modern Layouts

ANZZI Reports Wet Rooms Reach 1-in-6 Bathrooms—How Bathroom Glass Shower Doors Are Redefining Modern Layouts

ANZZI shares 2026 insights showing wet rooms now appear in 1 in 6 renovated bathrooms, highlighting how bathroom glass

March 11, 2026

Panama City Beach Realtor Beth Mulvey Wins Three 2026 Agent of the Year Awards

Panama City Beach Realtor Beth Mulvey Wins Three 2026 Agent of the Year Awards

Beth Mulvey of Beach House Sales & Development wins 2026 Agent of the Year for Bay County, Panama City Beach, and

March 11, 2026

GSSM Celebrates Google–GSSM AI Grant Announcement

GSSM Celebrates Google–GSSM AI Grant Announcement

South Carolina Governor's School for Science and Mathematics announces Google AI grant HARTSVILLE, SC, UNITED STATES,

March 11, 2026

Marian Village Featured in Lincolnway – City Lifestyle Highlighting Community-Centered Living

Marian Village Featured in Lincolnway – City Lifestyle Highlighting Community-Centered Living

At Marian Village, creating a true sense of home begins with fostering belonging, dignity and meaningful connection.”—

March 11, 2026

How Cisco Engineer Ashwani Sugandhi Slashed Network Errors 60% With NETCONF and YANG Automation

How Cisco Engineer Ashwani Sugandhi Slashed Network Errors 60% With NETCONF and YANG Automation

A behind-the-scenes look at how a Cisco automation engineer replaced fragile CLI workflows with model-driven

March 11, 2026

Routine Home Cleaning Contributes to Healthier Indoor Living Conditions

Routine Home Cleaning Contributes to Healthier Indoor Living Conditions

Routine home cleaning helps remove dust, allergens, and everyday contaminants that collect within indoor spaces”—

March 11, 2026