aron/parallelism_in_time_strong_scaling
Parallel Solution of Partial Differential Equations
Parallelism in Time
Recently, several new methods have been designed to exploit “time-parallelism” in a problem. These algorithms are unified in the sense that they typically first construct a low-order approximation to the desired solution in time that can then be concurrently refined via high-order methods. Three well known algorithms are parareal, PFASST, and RIDC.
Of the three algorithms, parareal can be considered the most mature, with the seminal work presented in 2001 by Lions, Maday, and Turinici. The parareal approach employs a fast, coarse solver, \(G\) in serial to compute an initial solution. It then slices the solution into time slices, and applies a high-resolution solver, \(F\), concurrently, on each time slice.
Parallel Full Approximation Scheme in Space and Time (PFASST) can be considered as a successor to parareal, further improves the algorithm by employing spectral deferred corrections and full approximation scheme corrections to non-linear problems.
Revised Integral Deferred Correction (RIDC), on the other hand, focuses on the concurrency exposed in the Integral Deferred Correction scheme by successively applied corrections to a coarse predictor when the predictor is allowed to proceed without synchronizing feedback from the finely resolved corrector steps. In this sense, RIDC might be more accurately referred to as “parallel-in-order-accuracy”, as the concurrency of the model can be no greater than the order of the method.
Parallelism in Space
Parallel-in-time algorithms have not been widely implemented, and this is probably due to the fact that the exploitation of parallelism in space for scientific codes has been so successful. Indeed, the numerical solution of partial differential equations in parallel is so prevalent that an entire field, Domain Decomposition, has largely arisen in response to the development of this model.
The key idea of parallelism in space is to decompose the spatial domain of the problem in such a way that the problem can be concurrently solved by many processes working together in parallel. We also include in this field particle-based decompositions, where the system being modeled is comprised of a finite number of particles that is partitioned over the processors. When the problem being solved is hyperbolic in nature, the processors need to communicate information at the boundaries of their domains for every update in time. For elliptic and parabolic problems, the techniques involved can be significantly more complicated, as information can propagate with infinite speed across the entire problem domain at each time step.
Parallelism in Space and Time
A basic understanding of relativity, light cones, and causality suggests that spatial parallelism is inherently more efficient than time parallelism, particularly for hyperbolic partial differential equations, where time integration updates are always local. Still, the proponents of parallelism in time suggest that there is a common situation where parallelism in time can supplement parallelism in space, the spatially strong scaling-limited regime of computation. We will discuss this further in the section on spatial strong scaling limits, but first we must review some performance models and techniques for numerical algorithms.
Performance Models and Techniques for Numerical Algorithms
Early beginnings, the Turing and Random Access Machines
Assessing the theoretical performance of an algorithm for a given problem and architecture requires a formal analytical model that is both abstract enough that it can be applied to a wide variety of problems and architectures, but also reified enough that it provides practically useful bounds and estimates.
The earliest and most common analytical tool for algorithm analysis is the Turing machine, a hypothetical “automated machine” originally conceptualized by Alan Turing in 1936 and first introduced in 1937 in the seminal work: “On Computable Numbers, with an Application to the Entscheidungsproblem” Turing1937OCNA. The four principal components of a Turing machine are:
- An infinite tape, representing the storage and communication channels of the device, divided into a set of squares which contain a symbol from some finite alphabet.
- The ‘head’, which Turing referred to as the scanned symbol. The Turing machine only operates on a single symbol of the tape at a time, and this represents the symbol “in the machine”.
- A state register which stores the current status of the machine, which is one of a finite number of states.
- The transition table, which specifies the action of the Turing machine (shift left/shift right, write/erase symbol, next state) given the current state and symbol under the head.
Although the Turing machine is a powerful theoretical concept that allows us to prove many fundamental results about the computability of algorithms and their general categorization, it is clearly a very abstract model of modern computers, and provides very little help in actually predicting or optimizing performance for today’s algorithms and machines.
As machines evolved, our concepts and models for them evolved as well. Register machines, which contain a finite set of integer registers that can be accessed and modified at each state, replace the Turing machine head and tape. The most well-known of the register machine models is the Random Access Model based on a Harvard architecture, which was introduced by Cook and Reckhow in 1973.
First theory on I/O complexity (Valiant/Hopcroft), shell games
Followed forty years later by Valiant in Hopcroft1977OTVS.
- I/O complexity bounds as a partition
- Introduction of the polyhedral model
- Tiling and loop skewing as compiler optimizations
- Introduction of memory bandwidth as a bottleneck, cache-oblivious models
- The importance of latency models in improving computational performance
- The era of motifs, microkernels, and auto-tuning
Spatial strong scaling limits
A computational software is said to be spatially strong scaling-limited for a particular problem and computer architecture when increasing the number of processors solving the problem increases, instead of decreases the time to solution.
Function call overhead
Communication limits
References
Turing, A. M., “On Computable Numbers, with an Application to the Entscheidungsproblem.” Proceedings of the London Mathematical Society, 1937.
Turing, A. M. “Correction to: On Computable Numbers, with an Application to the Entscheidungsproblem.” Proceedings of the London Mathematical Society, 1938.
Hopcroft, John and Paul, Wolfgang and Valiant, Leslie, “On Time Versus Space.”, Journal of the ACM, 1977.