Computer scientists have long known about a phenomenon in cloud data centres called ‘stragglers’ that significantly slow down tasks.
However, for the first time, research conducted by Lancaster University, University of Leeds and Beihang University has measured the true cause and impact of stragglers, as well as offering a way of detecting them at an early stage in order to reduce their disruption.
Huge modern cloud computing datacentres break tasks down into small pieces in order to crunch the information quicker across thousands of server nodes. However small numbers of these pieces can take a lot longer to be processed (hence the term stragglers) slowing down the whole task.
These delays can cause cloud-based tasks to be delayed by many minutes, frustrating users as well as producing expensive and energy intensive inefficiencies at giant datacentres. As datacentres grow larger, so does the problem of stragglers.
Stragglers could also hold back the full potential of emerging technologies such as driverless vehicles that would rely on cloud connectivity for real-time navigation and collision avoidance.
This research is an important step towards eliminating the stragglers problem leading to faster and more reliable cloud services.
Dr Peter Garraghan, Lecturer in Distributed Systems at Lancaster University’s School of Computing and Communications, said: “Researchers looked at two large-scale commercial cloud datacentres – Google’s cluster and a Chinese-based centre. Their results show that four to six per cent of stragglers affect up to half (37-49 per cent) of total jobs, delaying them by up to 14 minutes. They also found that more than half (53 per cent) of stragglers were caused by overloaded processors and servers.
In addition, the researchers found that through a combination of offline and online analytics they are able to detect 95 per cent of stragglers at 11 per cent of the way through small tasks – enabling mitigation techniques to be triggered that speed up the tasks.
The work is presented in the paper ‘Straggler root-cause and impact analysis for massive-scale virtualised cloud datacentres’ which has been published in IEEE Transactions on Services Computing, Special Issue on Virtualisation and Services.
The paper’s authors are Peter Garraghan, Lancaster University, Jie Xu, Xue Ouyang and David McKee, University of Leeds, Renyu Yang, Beihang University.