A mathematical precipice for scalable computing?

A mathematical precipice for scalable computing?

David CropWritten by David Standingford PhD CEng FIET FIMA MAustMS, lead technologist, CFMS.

To date, much of the rhetoric from the modelling and simulation community assumes that computing issues are “handled by someone else” and that we can assume ever faster processing, large and homogeneous storage plus constant (and low) memory latency. 

Computers are actually quite unlike this.  Perhaps we could cast the hierarchical properties of a large computing system as a positive rather than a strictly negative feature? Latency, for example is used to great effect in electromagnetic delay lines, and even in early computer memory systems. Might there be classes of solution algorithm that benefit from asynchronous information transfer, or could there be a trade-off in terms of cost and robustness when scalability increases the certainty of component failure?

I proposed the topic above a couple of weeks ago at the European Study Group for Industrial Mathematics in Industry organised by theSmith Institute with The KTN at Manchester University and sponsored by Innovate UK. The format of the event is where industrial organisations propose challenges in their own sector to a general but large audience of professional and student mathematicians. The idea is for the audience to think about which challenge they want to take on for the rest of the week, and actively contribute. Debate, discussion and lots of writing on the board are all strongly encouraged!

The topic that I proposed was the notion that the mathematics underpinning advanced modelling and simulation doesn’t really acknowledge modern computing architecture, and that there’s a presumption about the way that computers work in the minds of people creating mathematical algorithms, which is no longer appropriate.

 There is a common presumption among mathematical algorithm developers that a high performance computer is a very large, fast, mathematical processing box with an ever-increasing amount of memory on each of its computational nodes, all connected together with a super-fast network.  The processing speed, the available memory and the network latency and bandwidth are all understood to be improved year-on-year by hardware manufacturers towards an ideal platform model that will meet the needs of the developer without much extra work to achieve high performance at extreme scale, possibly with some clever compilers.  Current levels of latency, low bandwidth, memory cache sizes and processing speed are seen as temporary blockers to using the full power of computing, and preferably somebody else’s responsibility!

If you think about it the amount of memory you have on a computer, its processing power and bandwidth, network and latency are actually features of the computing space that you need to consider when you’re thinking about an algorithm.  We established a basic test case and a top level description on how a modern, High Performance Computer (HPC) works in order that we could find a mapping between the two, and stimulate the community to create a kind of new mathematics for describing the relationship between the two.

My view is that we probably need to accept - within a mathematical framework - that reliability of 100 percent as an HPC system is scaled up towards exascale is going to be unachievable (or at least unaffordable) and therefore mathematics needs to be tolerant of it.  Messages will be lost, nodes will fail and data will become corrupted - yet the system should soldier on.

There’s always going to be a trade off between performance and cost. Traditionally in the HPC world of modelling and simulation, we take it for granted that higher performance and greater reliability - even at relatively high cost - are fundamental to progress. 

However, our colleagues in the “big data” community generally accept (arguably because their problems are very highly parallel) that they can utilise cheaper, commodity hardware to achieve vast amounts of processing power.  If a piece of information isn’t available or is corrupted, it simply gets recalculated somewhere else and the whole system moves forward.

This isn’t generally the case with large simulation tasks.  The mathematics underpinning simulation has been so highly tuned that algorithms are intolerant of data delays and errors. If something falls over, the whole simulation crashes and it needs to be restarted, often at great cost.  This is why it could an exciting time for mathematics.  We’re possibly at a computational precipice for existing methods and there opportunities for researchers out there to push the boundaries in new and disruptive ways.

Sign Up To Our Newsletter and Events