ECE/CS 6960/5960 Fundamentals of Cloud Systems
Paper List
Part I: Cloud Computing
Coded Distributed Computing (Coded MapReduce)
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.
Zaharia, Matei, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. “Spark: Cluster computing with working sets.” HotCloud 10, no. 10-10 (2010): 95.
Li, Songze, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr. “Coded mapreduce.” In Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, pp. 964-971. IEEE, 2015.
Li, Songze, Sucha Supittayapornpong, Mohammad Ali Maddah-Ali, and Salman Avestimehr. “Coded terasort.” In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International, pp. 389-398. IEEE, 2017. (Coded TeraShort Implementations: here)
Li, Songze, Mohammad Ali Maddah-Ali, Qian Yu, and A. Salman Avestimehr. “A fundamental tradeoff between computation and communication in distributed computing.” IEEE Transactions on Information Theory 64, no. 1 (2018): 109-128.
Ji, Mingyue, Giuseppe Caire, and Andreas F. Molisch. “Fundamental limits of caching in wireless D2D networks.” IEEE Transactions on Information Theory 62, no. 2 (2016): 849-869.
Li, Songze, Mohammad Ali Maddah-Ali, Qian Yu, and A. Salman Avestimehr. “A fundamental tradeoff between computation and communication in distributed computing.” IEEE Transactions on Information Theory 64, no. 1 (2018): 109-128.
Woolsey, Nicholas, Rong-Rong Chen and Mingyue Ji, “A New Combinatorial Design of Coded Distributed Computing,” 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, 2018, pp. 726-730.
Li, Songze, Qian Yu, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr. “A scalable framework for wireless distributed computing.” IEEE/ACM Transactions on Networking 25, no. 5 (2017): 2643-2654.
Li, Songze, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr. “Coding for distributed fog computing.” IEEE Communications Magazine 55, no. 4 (2017): 34-40.
Reisizadeh, Amirhossein, Saurav Prakash, Ramtin Pedarsani, and Amir Salman Avestimehr. “Coded computation over heterogeneous clusters.” arXiv preprint arXiv:1701.05973 (2017).
Kiamari, Mehrdad, Chenwei Wang, and A. Salman Avestimehr. “On heterogeneous coded distributed computing.” In GLOBECOM 2017-2017 IEEE Global Communications Conference, pp. 1-7. IEEE, 2017.
Ezzeldin, Yahya H., Mohammed Karmoose, and Christina Fragouli. “Communication vs distributed computation: an alternative trade-off curve.” In Information Theory Workshop (ITW), 2017 IEEE, pp. 279-283. IEEE, 2017.
Konstantinidis, Konstantinos, and Aditya Ramamoorthy. “Leveraging Coding Techniques for Speeding up Distributed Computing.” arXiv preprint arXiv:1802.03049 (2018).
Prakash, Saurav, Amirhossein Reisizadeh, Ramtin Pedarsani, and Salman Avestimehr. “Coded Computing for Distributed Graph Analytics.” arXiv preprint arXiv:1801.05522 (2018).
Srinivasavaradhan, Sundara Rajan, Linqi Song, and Christina Fragouli. “Distributed Computing Trade-offs with Random Connectivity.” In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1281-1285. IEEE, 2018.
Song, Linqi, Sundara Rajan Srinivasavaradhan, and Christina Fragouli. “The benefit of being flexible in distributed computation.” In Information Theory Workshop (ITW), 2017 IEEE, pp. 289-293. IEEE, 2017.
Li, Songze, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr. “Compressed Coded Distributed Computing.” arXiv preprint arXiv:1805.01993 (2018).
Yang, Yaoqing, Matteo Interlandi, Pulkit Grover, Soummya Kar, Saeed Amizadeh, and Markus Weimer. “Coded Elastic Computing.” arXiv preprint arXiv:1812.06411 (2018).
Woolsey, Nicholas, Rong-Rong Chen, and Mingyue Ji. “Cascaded Coded Distributed Computing on Heterogeneous Networks.” arXiv preprint arXiv:1901.07670 (2019).
Yang, Yaoqing, Matteo Interlandi, Pulkit Grover, Soummya Kar, Saeed Amizadeh, and Markus Weimer. “Coded Elastic Computing.” arXiv preprint arXiv:1812.06411 (2018).
Straggler Mitigation via Coding
Dean, Jeffrey, and Luiz André Barroso. “The tail at scale.” Communications of the ACM 56, no. 2 (2013): 74-80.
Weinberg, Jonathan. “Job Scheduling on Parallel Systems.” In Job Scheduling Strategies for Parallel Processing. 2002.
“Task Assignment Policies for Server Farms.”, Book Chapter in “Performance Modeling and Design of Computer Systems Queueing Theory in Action”.
Zaharia, Matei, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing.” In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 2-2. USENIX Association, 2012.
Ananthanarayanan, Ganesh, Ali Ghodsi, Scott Shenker, and Ion Stoica. “Effective Straggler Mitigation: Attack of the Clones.” In NSDI, vol. 13, pp. 185-198. 2013.
Wang, Da, Gauri Joshi, and Gregory Wornell. “Using straggler replication to reduce latency in large-scale parallel computing.” ACM SIGMETRICS Performance Evaluation Review 43, no. 3 (2015): 7-11.
Lee, Kangwook, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. “Speeding up distributed machine learning using codes.” IEEE Transactions on Information Theory 64, no. 3 (2018): 1514-1529.
Lee, Kangwook, Changho Suh, and Kannan Ramchandran. “High-dimensional coded matrix multiplication.” In Information Theory (ISIT), 2017 IEEE International Symposium on, pp. 2418-2422. IEEE, 2017.
Yu, Qian, Mohammad Maddah-Ali, and Salman Avestimehr. “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication.” In Advances in Neural Information Processing Systems, pp. 4403-4413. 2017.
Yu, Qian, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr. “Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding.” arXiv preprint arXiv:1801.07487 (2018).
Gardner, Kristen, Mor Harchol-Balter, and Alan Scheller-Wolf. “A better model for job redundancy: Decoupling server slowdown and job size.” In Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016 IEEE 24th International Symposium on, pp. 1-10. IEEE, 2016.
Joshi, Gauri, Emina Soljanin, and Gregory Wornell. “Efficient replication of queued tasks for latency reduction in cloud systems.” In Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, pp. 107-114. IEEE, 2015.
Ousterhout, Kay, Patrick Wendell, Matei Zaharia, and Ion Stoica. “Sparrow: distributed, low latency scheduling.” In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 69-84. ACM, 2013.
Dutta, Sanghamitra, Mohammad Fahim, Farzin Haddadpour, Haewon Jeong, Viveck Cadambe, and Pulkit Grover. “On the optimal recovery threshold of coded matrix multiplication.” arXiv preprint arXiv:1801.10292 (2018).
Dutta, Sanghamitra, Viveck Cadambe, and Pulkit Grover. “Short-dot: Computing large linear transforms distributedly using coded short dot products.” In Advances In Neural Information Processing Systems, pp. 2100-2108. 2016.
Dutta, Sanghamitra, Viveck Cadambe, and Pulkit Grover. “Coded convolution for parallel and distributed computing within a deadline.” In Information Theory (ISIT), 2017 IEEE International Symposium on, pp. 2403-2407. IEEE, 2017.
Sheth, Utsav, Sanghamitra Dutta, Malhar Chaudhari, Haewon Jeong, Yaoqing Yang, Jukka Kohonen, Teemu Roos, and Pulkit Grover. “An Application of Storage-Optimal MatDot Codes for Coded Matrix Multiplication: Fast k-Nearest Neighbors Estimation.” arXiv preprint arXiv:1811.11811 (2018).
Dutra, Sanghamitra, Ziqian Bai, Haewon Jeong, Tze Meng Low, and Pulkit Grover. “A unified coded deep neural network training strategy based on generalized polydot codes.” In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1585-1589. IEEE, 2018.
Yu, Qian, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr. “Coded fourier transform.” In Communication, Control, and Computing (Allerton), 2017 55th Annual Allerton Conference on, pp. 494-501. IEEE, 2017.
Tandon, Rashish, Qi Lei, Alexandros G. Dimakis, and Nikos Karampatziakis. “Gradient coding: Avoiding stragglers in distributed learning.” In International Conference on Machine Learning, pp. 3368-3376. 2017.
Raviv, Netanel, Itzhak Tamo, Rashish Tandon, and Alexandros G. Dimakis. “Gradient coding from cyclic MDS codes and expander graphs.” arXiv preprint arXiv:1707.03858 (2017).
Ye, Min, and Emmanuel Abbe. “Communication-computation efficient gradient coding.” arXiv preprint arXiv:1802.03475 (2018).
Yang, Yaoqing, Pulkit Grover, and Soummya Kar. “Coded distributed computing for inverse problems.” In Advances in Neural Information Processing Systems, pp. 709-719. 2017.
Maity, Raj Kumar, Ankit Singh Rawat, and Arya Mazumdar. “Robust gradient descent via moment encoding with ldpc codes.” arXiv preprint arXiv:1805.08327 (2018).
Severinson, Albin, Alexandre Graell i Amat, and Eirik Rosnes. “Block-diagonal and lt codes for distributed computing with straggling servers.” IEEE Transactions on Communications (2018).
Yang, Heecheol, and Jungwoo Lee. “Secure distributed computing with straggling servers using polynomial codes.” IEEE Transactions on Information Forensics and Security 14, no. 1 (2019): 141-150.
Aliasgari, Malihe, Osvaldo Simeone, and Joerg Kliewer. “Distributed and Private Coded Matrix Computation with Flexible Communication Load.” arXiv preprint arXiv:1901.07705 (2019).
Haddadpour, Farzin, Yaoqing Yang, Malhar Chaudhari, Viveck R. Cadambe, and Pulkit Grover. “Straggler-resilient and communication-efficient distributed iterative linear solver.” arXiv preprint arXiv:1806.06140 (2018).
Kiani, Shahrzad, Nuwan Ferdinand, and Stark C. Draper. “Exploitation of stragglers in coded computation.” In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1988-1992. IEEE, 2018.
Ferdinand, Nuwan, and Stark C. Draper. “Hierarchical coded computation.” In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1620-1624. IEEE, 2018.
Data Shufflng
Lee, Kangwook, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. “Speeding up distributed machine learning using codes.” IEEE Transactions on Information Theory 64, no. 3 (2018): 1514-1529.
Attia, Mohamed A., and Ravi Tandon. “Near Optimal Coded Data Shuffling for Distributed Learning.” arXiv preprint arXiv:1801.01875 (2018).
Elmahdy, Adel, and Soheil Mohajer. “On the Fundamental Limits of Coded Data Shuffling for Distributed Learning Systems.” arXiv preprint arXiv:1807.04255 (2018).
Wan, Kai, Daniela Tuninetti, Mingyue Ji, and Pablo Piantanida. “Fundamental limits of distributed data shuffling.” arXiv preprint arXiv:1807.00056 (2018).
Chung, Jichan, Kangwook Lee, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. “UberShuffle: Communication-efficient Data Shuffling for SGD via Coding Theory.”
Adversarial Computing Nodes Tolerance
Chen, Lingjiao, Hongyi Wang, Zachary Charles, and Dimitris Papailiopoulos. “DRACO: Byzantine-resilient Distributed Training via Redundant Gradients.” In International Conference on Machine Learning, pp. 902-911. 2018.
Kadhe, Swanand, O. Ozan Koyluoglu, and Kannan Ramchandran. “Gradient Coding Based on Block Designs for Mitigating Adversarial Stragglers.” arXiv preprint arXiv:1904.13373 (2019).
Secure Coded Computation
Bitar, Rawad, and Salim El Rouayheb. “Staircase codes for secret sharing with optimal communication and read overheads.” IEEE Transactions on Information Theory 64, no. 2 (2018): 933-943.
Bitar, Rawad, Parimal Parag, and Salim El Rouayheb. “Minimizing latency for secure coded computing using secret sharing via staircase codes.” arXiv preprint arXiv:1802.02640 (2018).
D'Oliveira, Rafael GL, Salim El Rouayheb, and David Karpuk. “GASP Codes for Secure Distributed Matrix Multiplication.” arXiv preprint arXiv:1812.09962 (2018).
Bitar, Rawad, Yuxuan Xing, Yasaman Keshtkarjahromi, Venkat Dasari, Salim El Rouayheb, and Hulya Seferoglu. “PRAC: Private and Rateless Adaptive Coded Computation at the Edge.” (2019).
Using Efficient Redundancy to Reduce Latency and Computing Cost in Cloud Systems
Wang, Da, Gauri Joshi, and Gregory Wornell. “Using straggler replication to reduce latency in large-scale parallel computing.” ACM SIGMETRICS Performance Evaluation Review 43, no. 3 (2015): 7-11.
Gauri Joshi, Emina Soljanin, and Gregory Wornell. “Efficient redundancy techniques for latency reduction in cloud systems.” ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS) 2, no. 2 (2017): 12.
Jiang, Zhiyuan, Sheng Zhou, Xueying Guo, and Zhisheng Niu. “Task replication for deadline-constrained vehicular cloud computing: Optimal policy, performance analysis, and implications on road traffic.” IEEE Internet of Things Journal 5, no. 1 (2018): 93-107.
Sun, Yin, C. Emre Koksal, and Ness B. Shroff. “On delay-optimal scheduling in queueing systems with replications.” arXiv preprint arXiv:1603.07322 (2016).
Aktas, Mehmet Fatih, Pei Peng, and Emina Soljanin. “Effective straggler mitigation: Which clones should attack and when?.” arXiv preprint arXiv:1710.00748 (2017).
Dutta, Sanghamitra, Gauri Joshi, Soumyadip Ghosh, Parijat Dube, and Priya Nagpurkar. “Slow and stale gradients can win the race: Error-runtime trade-offs in distributed SGD.” arXiv preprint arXiv:1803.01113 (2018).
Aktas, Mehmet Fatih, Pei Peng, and Emina Soljanin. “Straggler mitigation by delayed relaunch of tasks.” arXiv preprint arXiv:1710.00414 (2017).
van der Boor, Mark, Sem C. Borst, Johan SH van Leeuwaarden, and Debankur Mukherjee. “Scalable load balancing in networked systems: A survey of recent advances.” arXiv preprint arXiv:1806.05444 (2018).
Xu, Maotong, Sultan Alamro, Tian Lan, and Suresh Subramaniam. “Chronos: A unifying optimization framework for speculative execution of deadline-critical mapreduce jobs.” In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pp. 718-729. IEEE, 2018.
Beaumont, Olivier, Lionel Eyraud-Dubois, and Yihong Gao. “Influence of Tasks Duration Variability on Task-Based Runtime Schedulers.” (2018).
Behrouzi-Far, Amir, and Emina Soljanin. “On the Effect of Task-to-Worker Assignment in Distributed Computing Systems with Stragglers.” In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 560-566. IEEE, 2018.
Zeng, Yun, Jian Tan, and Cathy H. Xia. “Fork and Join Queueing Networks with Heavy Tails: Scaling Dimension and Throughput Limit.” ACM SIGMETRICS Performance Evaluation Review 46, no. 1 (2019): 122-124.
Chen, Lixing, and Jie Xu. “Task Offloading and Replication for Vehicular Cloud Computing: A Multi-Armed Bandit Approach.” arXiv preprint arXiv:1812.04575 (2018).
Qiu, Zhan, Juan F. Pérez, and Peter G. Harrison. “Tackling latency via replication in distributed systems.” In Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering, pp. 197-208. ACM, 2016.
Joshi, Gauri. “Synergy via redundancy: Boosting service capacity with adaptive replication.” ACM SIGMETRICS Performance Evaluation Review 45, no. 2 (2018): 21-28.
Wang, Weina, Mor Harchol-Balter, Haotian Jiang, Alan Scheller-Wolf, and R. Srikant. “Delay asymptotics and bounds for multi-task parallel jobs.” ACM SIGMETRICS Performance Evaluation Review 46, no. 3 (2019): 2-7.
Kaler, Tim, Yuxiong He, and Sameh Elnikety. “Optimal Reissue Policies for Reducing Tail Latency.” In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 195-206. ACM, 2017.
% – Aktaş, Mehmet Fatih, and Emina Soljanin. “Heuristics for Analyzing Download Time in MDS Coded Storage Systems.” In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1929-1933. IEEE, 2018.
Qiu, Zhan, Juan F. Pérez, and Peter G. Harrison. “Tackling latency via replication in distributed systems.” In Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering, pp. 197-208. ACM, 2016.
Zaryadov, Ivan, Andrey Kradenyh, and Anastasiya Gorbunova. “The Analysis of Cloud Computing System as a Queueing System with Several Servers and a Single Buffer.” In International Conference on Analytical and Computational Methods in Probability Theory, pp. 11-22. Springer, Cham, 2017.
Wang, Huajin, Jianhui Li, Zhihong Shen, and Yuanchun Zhou. “Approximations and Bounds for (n, k) Fork-Join Queues: A Linear Transformation Approach.” In 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 422-431. IEEE, 2018.
Anderson, Sarah E., Ann Johnston, Gauri Joshi, Gretchen L. Matthews, Carolyn Mayer, and Emina Soljanin. “Service Rate Region of Content Access from Erasure Coded Storage.” In 2018 IEEE Information Theory Workshop (ITW), pp. 1-5. IEEE, 2018.
Mukherjee, Debankur. “Scalable load balancing algorithms in networked systems.” arXiv preprint arXiv:1809.02018 (2018).
Cloud Network Control
Nahir, Amir, Ariel Orda, and Danny Raz. “Resource allocation and management in cloud computing.” In Integrated Network Management (IM), 2015 IFIP/IEEE International Symposium on, pp. 1078-1084. IEEE, 2015.
Feng, H., Llorca, J., Tulino, A.M. and Molisch, A.F., 2018. Optimal Control of Wireless Computing Networks. IEEE Transactions on Wireless Communications, 17(12), pp.8283-8298.
Feng, Hao, Jaime Llorca, Antonia M. Tulino, and Andreas F. Molisch. “Optimal dynamic cloud network control.” IEEE/ACM Transactions on Networking (TON) 26, no. 5 (2018): 2118-2131.
Zhang, Jianan, Abhishek Sinha, Jaime Llorca, Antonia Tulino, and Eytan Modiano. “Optimal Control of Distributed Computing Networks with Mixed-Cast Traffic Flows.” arXiv preprint arXiv:1805.10527 (2018).
Wang, Chang-Heng, Jaime Llorca, Antonia M. Tulino, and Tara Javidi. “Dynamic Cloud Network Control under Reconfiguration Delay and Cost.” arXiv preprint arXiv:1802.06581 (2018).
Jiao, Lei, Antonia Maria Tulino, Jaime Llorca, Yue Jin, and Alessandra Sala. “Smoothed online resource allocation in multi-tier distributed cloud networks.” IEEE/ACM Transactions on Networking (TON) 25, no. 4 (2017): 2556-2570.
Mukherjee, Debankur. “Scalable load balancing algorithms in networked systems.” arXiv preprint arXiv:1809.02018 (2018).
Atomicity and Consistency
Cadambe, Viveck R., Nancy Lynch, Muriel Médard, and Peter Musial. “A coded shared atomic memory algorithm for message passing architectures.” Distributed Computing 30, no. 1 (2017): 49-73.
Cadambe, Viveck, Nicolas Nicolaou, Kishori M. Konwar, N. Prakash, Nancy Lynch, and Muriel Medard. “ARES: Adaptive, Reconfigurable, Erasure coded, atomic Storage.” arXiv preprint arXiv:1805.03727 (2018).
Konwar, Kishori M., N. Prakash, Nancy Lynch, and Muriel Médard. “A layered architecture for erasure-coded consistent distributed storage.” arXiv preprint arXiv:1703.01286 (2017).
Ali, Ramy E., and Viveck R. Cadambe. “Multi-version Coding for Consistent Distributed Storage of Correlated Data Updates.” arXiv preprint arXiv:1708.06042 (2017).
Ali, Ramy E., and Viveck Cadambe. “Harnessing Correlations in Distributed Erasure Coded Key-Value Stores.” arXiv preprint arXiv:1810.01527 (2018).
Wang, Zhiying, and Viveck Cadambe. “Multi-version coding in distributed storage.” In Information Theory (ISIT), 2014 IEEE International Symposium on, pp. 871-875. IEEE, 2014.
Ali, Ramy E., Viveck Cadambe, Jaime Llorca, and Antonia Tulino. “Multi-version Coding with Side Information.” arXiv preprint arXiv:1805.04337 (2018).
Wang, Zhiying, and Viveck R. Cadambe. “Multi-Version Coding—An Information-Theoretic Perspective of Consistent Distributed Storage.” IEEE Transactions on Information Theory 64, no. 6 (2018): 4540-456
Part II: Distributed Storage Theory (not thorough)
Part III: Distributed Machine Learning (not thorough)
Synchronous (Stochastic) Gradient Descent
Yin, Dong, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and Peter Bartlett. “Gradient diversity: a key ingredient for scalable distributed learning.” arXiv preprint arXiv:1706.05699 (2017).
Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. “Optimization methods for large-scale machine learning.” Siam Review 60, no. 2 (2018): 223-311.
Bousquet, Olivier, and André Elisseeff. “Stability and generalization.” Journal of machine learning research 2, no. Mar (2002): 499-526.
Chen, Jianmin, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. “Revisiting distributed synchronous SGD.” arXiv preprint arXiv:1604.00981 (2016).
Cotter, Andrew, Ohad Shamir, Nati Srebro, and Karthik Sridharan. “Better mini-batch algorithms via accelerated gradient methods.” In Advances in neural information processing systems, pp. 1647-1655. 2011.
De, Soham, Abhay Yadav, David Jacobs, and Tom Goldstein. “Big batch SGD: Automated inference using adaptive batch sizes.” arXiv preprint arXiv:1610.05792 (2016).
Karimi, Hamed, Julie Nutini, and Mark Schmidt. “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition.” In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795-811. Springer, Cham, 2016.
Lee, Jason D., Qihang Lin, Tengyu Ma, and Tianbao Yang. “Distributed stochastic variance reduced gradient methods by sampling extra data with replacement.” The Journal of Machine Learning Research 18, no. 1 (2017): 4404-4446.
Li, Mu, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. “Efficient mini-batch training for stochastic optimization.” In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 661-670. ACM, 2014.
Lian, Xiangru, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent.” In Advances in Neural Information Processing Systems, pp. 5330-5340. 2017.
Zinkevich, Martin, Markus Weimer, Lihong Li, and Alex J. Smola. “Parallelized stochastic gradient descent.” In Advances in neural information processing systems, pp. 2595-2603. 2010.
Cotter, Andrew, Ohad Shamir, Nati Srebro, and Karthik Sridharan. “Better mini-batch algorithms via accelerated gradient methods.” In Advances in neural information processing systems, pp. 1647-1655. 2011.
Dekel, Ofer, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. “Optimal distributed online prediction using mini-batches.” Journal of Machine Learning Research 13, no. Jan (2012): 165-202.
M. P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data tting.
SIAM Journal on Scientic Computing, 34(3):A1380{A1405, 2012.
M. Takac, A. S. Bijral, P. Richtarik, and N. Srebro. Mini-batch primal and dual methods for
svms. In ICML (3), pages 1022{1030, 2013.
M. Li, T. Zhang, Y. Chen, and A. J. Smola. Ecient mini-batch training for stochastic
optimization. In Proceedings of the 20th ACM SIGKDD, pages 661{670. ACM, 2014.
P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Parallelizing stochastic
approximation through mini-batching and tail-averaging. arXiv preprint arXiv:1610.03774,
2016.
Straggler Mitigation via Asynchronous (Stochastic) Gradient Descent
Mania, Horia, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I. Jordan. “Perturbed iterate analysis for asynchronous stochastic optimization.” arXiv preprint arXiv:1507.06970 (2015).
Pan, Xinghao, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos, Ce Zhang, Michael I. Jordan, Kannan Ramchandran, and Christopher Ré. “Cyclades: Conflict-free asynchronous machine learning.” In Advances in Neural Information Processing Systems, pp. 2568-2576. 2016.
Recht, Benjamin, Christopher Re, Stephen Wright, and Feng Niu. “Hogwild: A lock-free approach to parallelizing stochastic gradient descent.” In Advances in neural information processing systems, pp. 693-701. 2011.
Dutta, Sanghamitra, Gauri Joshi, Soumyadip Ghosh, Parijat Dube, and Priya Nagpurkar. “Slow and stale gradients can win the race: Error-runtime trade-offs in distributed SGD.” arXiv preprint arXiv:1803.01113 (2018).
X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic gradient for nonconvex optimization,” in Advances in Neural Informa- tion Processing Systems, 2015, pp. 2737–2745.
Zheng, Shuxin, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhi-Ming Ma, and Tie-Yan Liu. “Asynchronous stochastic gradient descent with delay compensation.” In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 4120-4129. JMLR. org, 2017.
X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decentralized parallel stochastic gradient descent,” arXiv preprint arXiv:1710.06952, 2017.
Straggler Mitigation via Coding
Dutta, Sanghamitra, Ziqian Bai, Tze Meng Low, and Pulkit Grover. “CodeNet: Training Large Scale Neural Networks in Presence of Soft-Errors.” arXiv preprint arXiv:1903.01042 (2019).
Dutta, Sanghamitra, Ziqian Bai, Haewon Jeong, Tze Meng Low, and Pulkit Grover. “A unified coded deep neural network training strategy based on generalized polydot codes for matrix multiplication.” arXiv preprint arXiv:1811.10751 (2018).
Sheth, Utsav, Sanghamitra Dutta, Malhar Chaudhari, Haewon Jeong, Yaoqing Yang, Jukka Kohonen, Teemu Roos, and Pulkit Grover. “An Application of Storage-Optimal MatDot Codes for Coded Matrix Multiplication: Fast k-Nearest Neighbors Estimation.” In 2018 IEEE International Conference on Big Data (Big Data), pp. 1113-1120. IEEE, 2018.
So, Jinhyun, Basak Guler, A. Salman Avestimehr, and Payman Mohassel. “CodedPrivateML: A Fast and Privacy-Preserving Framework for Distributed Machine Learning.” arXiv preprint arXiv:1902.00641 (2019).
Li, Songze, Seyed Mohammadreza Mousavi Kalan, Qian Yu, Mahdi Soltanolkotabi, and A. Salman Avestimehr. “Polynomially coded regression: Optimal straggler mitigation via data encoding.” arXiv preprint arXiv:1805.09934 (2018).
Avestimehr, A. Salman, Seyed Mohammadreza Mousavi Kalan, and Mahdi Soltanolkotabi. “Fundamental resource trade-offs for encoded distributed optimization.” arXiv preprint arXiv:1804.00217 (2018).
Communication Bottleneck and Gradient Quatization
Tsitsiklis, John N., and Zhi-Quan Luo. “Communication complexity of convex optimization.” Journal of Complexity 3, no. 3 (1987): 231-243.
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In INTERSPEECH, 2014.
Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH, 2015.
Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. Taming the wild: A unified analysis of hogwild-style algorithms. In NIPS, 2015.
Alistarh, Dan, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. “QSGD: Communication-efficient SGD via gradient quantization and encoding.” In Advances in Neural Information Processing Systems, pp. 1709-1720. 2017.
Wen, Wei, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. “Terngrad: Ternary gradients to reduce communication in distributed deep learning.” In Advances in neural information processing systems, pp. 1509-1519. 2017.
Federated Learning
Konečný, Jakub, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. “Federated learning: Strategies for improving communication efficiency.” arXiv preprint arXiv:1610.05492 (2016).
Wang, Shiqiang, Tiffany Tuor, Theodoros Salonidis, Kin K. Leung, Christian Makaya, Ting He, and Kevin Chan. “When edge meets learning: Adaptive control for resource-constrained distributed machine learning.” In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pp. 63-71. IEEE, 2018.
Tuor, Tiffany, Shiqiang Wang, Theodoras Salonidis, Bong Jun Ko, and Kin K. Leung. “Demo abstract: Distributed machine learning at resource-limited edge nodes.” In IEEE INFOCOM 2018-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 1-2. IEEE, 2018.
Wang, Shiqiang, Tiffany Tuor, Theodoros Salonidis, Kin K. Leung, Christian Makaya, Ting He, and Kevin Chan. “Adaptive federated learning in resource constrained edge computing systems.” IEEE Journal on Selected Areas in Communications (2019).
S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” Tech. Rep., 2019. Online. Available: https:arxiv.orgabs1804.05271
Tuor, Tiffany, Shiqiang Wang, Kin K. Leung, and Kevin Chan. “Distributed machine learning in coalition environments: overview of techniques.” In 2018 21st International Conference on Information Fusion (FUSION), pp. 814-821. IEEE, 2018.
Yousefpour, Ashkan, Caleb Fung, Tam Nguyen, Krishna Kadiyala, Fatemeh Jalali, Amirreza Niakanlahiji, Jian Kong, and Jason P. Jue. “All one needs to know about fog computing and related edge computing paradigms: a complete survey.” Journal of Systems Architecture (2019).
Yang, Qiang, Yang Liu, Tianjian Chen, and Yongxin Tong. “Federated Machine Learning: Concept and Applications.” ACM Transactions on Intelligent Systems and Technology (TIST) 10, no. 2 (2019): 12.
B. McMahan and D. Ramage, “Federated learning: Collaborative machine learning without centralized training data,” Apr. 2017. Online. Available: https:ai.googleblog.com201704/federated- learning-collaborative.html
Wang, Jianyu, and Gauri Joshi. “Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD.” arXiv preprint arXiv:1810.08313 (2018).
Wang, Jianyu, and Gauri Joshi. “Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms.” arXiv preprint arXiv:1808.07576 (2018).
K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation for federated learning on user-held data,” in NIPS Workshop on Private Multi-Party Machine Learning, 2016.
J. Konen, H. B. McMahan, D. Ramage, and P. Richtarik, “Federated optimization: Distributed machine learning for on-device intelligence,” 2016. Online. Available: https:arxiv.orgabs1610.02527
T. Nishio and R. Yonetani, “Client selection for federated learn- ing with heterogeneous resources in mobile edge,” arXiv preprint arXiv:1804.08333, 2018.
Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.
H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD with faster con- vergence and less communication: Demystifying why model averaging works for deep learning,” in AAAI Conference on Artificial Intelligence, Jan.–Feb. 2019.
C. Ma, J. Konecˇny‘, M. Jaggi, V. Smith, M. I. Jordan, P. Richta ́rik, and M. Taka ́cˇ, “Distributed optimization with arbitrary local solvers,” Optimization Methods and Software, vol. 32, no. 4, pp. 813–848, 2017.
|