The vision of this project is to address our limited understanding of stragglers – manifestation of a complex transient failure type within large-scale distributed systems – and conduct in-depth analysis and modelling to quantify the precise relationship between straggler occurrence and system behaviour. This study will involve analysis and modelling stragglers within real systems, performed through comprehensive experimentation to identify and extract key system parameters from virtual and physical sub-system operation across the entire distributed system architecture. This will determine the ‘perfect storm’ of system behavior which causes this complex failure type to occur.
By working with leading international industrialists who operate large Cloud datacenters, this work represents a significant step change towards solving stragglers by providing much sought-out knowledge to truly understand their behaviour. As this problem is systemic across every type of large-scale distributed system, the impact of this work will have far reaching implications for both academia and industry, and will provide direct benefit to the competitiveness of the UKs digital economy within the short and long-term. This grant represents the first step towards realizing the research ambitious to scientifically understanding the operation of massive-scale Internet infrastructure, enabling the design of fault-tolerant techniques for future systems at unprecedented scale – a crucial objective towards realizing key emergent technologies for the future.
This project is funded by the EPSRC, and includes project partners from Microsoft Research, STFC Rutherford, and CIATEQ.
Peter Garraghan (PI, Lancaster)