2018 |
|
63. | Skach, Matt; Arora, Manish; Tullsen, Dean; Tang, Lingjia; Mars, Jason Virtual melting temperature: managing server load to minimize cooling overhead with phase change materials Inproceedings 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 15–28, IEEE 2018. @inproceedings{skach2018virtual, title = {Virtual melting temperature: managing server load to minimize cooling overhead with phase change materials}, author = {Matt Skach and Manish Arora and Dean Tullsen and Lingjia Tang and Jason Mars}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/08416815.pdf}, year = {2018}, date = {2018-01-01}, booktitle = {2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)}, pages = {15--28}, organization = {IEEE}, abstract = {As the power density and power consumption of large scale datacenters continue to grow, the challenges of removing heat from these datacenters and keeping them cool is an increasingly urgent and costly. With the largest datacenters now exceeding over 200 MW of power, the cooling systems that prevent overheating cost on the order of tens of millions of dollars. Prior work proposed to deploy phase change materials (PCM) and use Thermal Time Shifting (TTS) to reshape the thermal load of a datacenter by storing heat during peak hours of high utilization and releasing it during off hours when utilization is low, enabling a smaller cooling system to handle the same peak load. The peak cooling load reduction enabled by TTS is greatly beneficial, however TTS is a passive system that cannot handle many workload mixtures or adapt to changing load or environmental characteristics. In this work we propose VMT, a thermal aware job placement technique that adds an active, tunable component to enable greater control over datacenter thermal output. We propose two different job placement algorithms for VMT and perform a scale out study of VMT in a simulated server cluster. We provide analysis of the use cases and trade-offs of each algorithm, and show that VMT reduces peak cooling load by up to 12.8% to provide over two million dollars in cost savings when a smaller cooling system is installed, or allows for over 7,000 additional servers to be added in scenarios where TTS is ineffective.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } As the power density and power consumption of large scale datacenters continue to grow, the challenges of removing heat from these datacenters and keeping them cool is an increasingly urgent and costly. With the largest datacenters now exceeding over 200 MW of power, the cooling systems that prevent overheating cost on the order of tens of millions of dollars. Prior work proposed to deploy phase change materials (PCM) and use Thermal Time Shifting (TTS) to reshape the thermal load of a datacenter by storing heat during peak hours of high utilization and releasing it during off hours when utilization is low, enabling a smaller cooling system to handle the same peak load. The peak cooling load reduction enabled by TTS is greatly beneficial, however TTS is a passive system that cannot handle many workload mixtures or adapt to changing load or environmental characteristics. In this work we propose VMT, a thermal aware job placement technique that adds an active, tunable component to enable greater control over datacenter thermal output. We propose two different job placement algorithms for VMT and perform a scale out study of VMT in a simulated server cluster. We provide analysis of the use cases and trade-offs of each algorithm, and show that VMT reduces peak cooling load by up to 12.8% to provide over two million dollars in cost savings when a smaller cooling system is installed, or allows for over 7,000 additional servers to be added in scenarios where TTS is ineffective. |
62. | Lin, Shih-Chieh; Hsu, Chang-Hong; Talamonti, Walter; Zhang, Yunqi; Oney, Steve; Mars, Jason; Tang, Lingjia Adasa: A Conversational In-Vehicle Digital Assistant for Advanced Driver Assistance Features Inproceedings Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, pp. 531–542, 2018. @inproceedings{lin2018adasa, title = {Adasa: A Conversational In-Vehicle Digital Assistant for Advanced Driver Assistance Features}, author = {Shih-Chieh Lin and Chang-Hong Hsu and Walter Talamonti and Yunqi Zhang and Steve Oney and Jason Mars and Lingjia Tang}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/lin2018adasa.pdf}, year = {2018}, date = {2018-01-01}, booktitle = {Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology}, pages = {531--542}, abstract = {Advanced Driver Assistance Systems (ADAS) come equipped on most modern vehicles and are intended to assist the driver and enhance the driving experience through features such as lane keeping system and adaptive cruise control. However, recent studies show that few people utilize these features for several reasons. First, ADAS features were not common until recently. Second, most users are unfamiliar with these features and do not know what to expect. Finally, the interface for operating these features is not intuitive. To help drivers understand ADAS features, we present a conversational in-vehicle digital assistant that responds to drivers' questions and commands in natural language. With the system prototyped herein, drivers can ask questions or command using unconstrained natural language in the vehicle, and the assistant trained by using advanced machine learning techniques, coupled with access to vehicle signals, responds in real-time based on conversational context. Results of our system prototyped on a production vehicle are presented, demonstrating its effectiveness in improving driver understanding and usability of ADAS.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Advanced Driver Assistance Systems (ADAS) come equipped on most modern vehicles and are intended to assist the driver and enhance the driving experience through features such as lane keeping system and adaptive cruise control. However, recent studies show that few people utilize these features for several reasons. First, ADAS features were not common until recently. Second, most users are unfamiliar with these features and do not know what to expect. Finally, the interface for operating these features is not intuitive. To help drivers understand ADAS features, we present a conversational in-vehicle digital assistant that responds to drivers' questions and commands in natural language. With the system prototyped herein, drivers can ask questions or command using unconstrained natural language in the vehicle, and the assistant trained by using advanced machine learning techniques, coupled with access to vehicle signals, responds in real-time based on conversational context. Results of our system prototyped on a production vehicle are presented, demonstrating its effectiveness in improving driver understanding and usability of ADAS. |
61. | Jain, Animesh; Laurenzano, Michael A; Pokam, Gilles A; Mars, Jason; Tang, Lingjia Architectural support for convolutional neural networks on modern CPUs Inproceedings Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, pp. 1–13, 2018. @inproceedings{jain2018architectural, title = {Architectural support for convolutional neural networks on modern CPUs}, author = {Animesh Jain and Michael A Laurenzano and Gilles A Pokam and Jason Mars and Lingjia Tang}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3243176.3243177.pdf}, year = {2018}, date = {2018-01-01}, booktitle = {Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques}, pages = {1--13}, abstract = {A key focus of recent work in our community has been on devising increasingly sophisticated acceleration devices for deep neural network (DNN) computation, especially for networks driven by convolution layers. Yet, despite the promise of substantial improvements in performance and energy consumption offered by these approaches, general purpose computing is not going away because its traditional well-understood programming model and continued wide deployment. Therefore, the question arises as to what can be done, if anything, to evolve conventional CPUs to accommodate efficient deep neural network computation. This work focuses on the challenging problem of identifying and alleviating the performance bottlenecks for convolution layer computation for conventional CPU platforms. We begin by performing a detailed study of a range of CNN-based applications on a modern CPU microarchitecture, finding that designing a physical register file (PRF) capable of feeding computational units is the primary barrier that prevents the addition of more compute units in the CPU, limiting the performance improvements that can be achieved by CPU on convolution layers. We present the design of a novel, minimally intrusive set of microarchitectural and ISA extensions that address this problem and describe the code generation support needed to take advantage our design. Through a detailed evaluation that covers 5 state-of-the-art neural network applications, we observe that applying these extensions allows packing more compute in the CPU while keeping PRF energy in check, achieving a 2× performance improvement and a 2.7× energy-delay product improvement against a popular Intel Haswell server processor baseline.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } A key focus of recent work in our community has been on devising increasingly sophisticated acceleration devices for deep neural network (DNN) computation, especially for networks driven by convolution layers. Yet, despite the promise of substantial improvements in performance and energy consumption offered by these approaches, general purpose computing is not going away because its traditional well-understood programming model and continued wide deployment. Therefore, the question arises as to what can be done, if anything, to evolve conventional CPUs to accommodate efficient deep neural network computation. This work focuses on the challenging problem of identifying and alleviating the performance bottlenecks for convolution layer computation for conventional CPU platforms. We begin by performing a detailed study of a range of CNN-based applications on a modern CPU microarchitecture, finding that designing a physical register file (PRF) capable of feeding computational units is the primary barrier that prevents the addition of more compute units in the CPU, limiting the performance improvements that can be achieved by CPU on convolution layers. We present the design of a novel, minimally intrusive set of microarchitectural and ISA extensions that address this problem and describe the code generation support needed to take advantage our design. Through a detailed evaluation that covers 5 state-of-the-art neural network applications, we observe that applying these extensions allows packing more compute in the CPU while keeping PRF energy in check, achieving a 2× performance improvement and a 2.7× energy-delay product improvement against a popular Intel Haswell server processor baseline. |
2017 |
|
60. | Hundt, Robert; Tang, Lingjia; Mars, Jason Allocation of tasks in large scale computing systems Miscellaneous 2017, (US Patent 9,563,532). @misc{hundt2017allocation, title = {Allocation of tasks in large scale computing systems}, author = {Robert Hundt and Lingjia Tang and Jason Mars}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/pat9563532.pdf}, year = {2017}, date = {2017-02-01}, abstract = {Aspects of the invention may be used to allocate tasks among computing machines in large scale computing systems. In one aspect, the method includes executing a first task in the plurality of tasks on a first computing machine and determining a performance degradation threshold for the first task. The method further includes calculating a predicted performance degradation of the first task when a second task is executed on the first computing machine, wherein the predicted performance degradation is determined by comparing a performance interference score of the second task with a performance sensitivity curve of the first task. The method further includes executing the second task on the first computing machine when the predicted performance degradation of the first task is below the performance degradation threshold.}, note = {US Patent 9,563,532}, keywords = {}, pubstate = {published}, tppubtype = {misc} } Aspects of the invention may be used to allocate tasks among computing machines in large scale computing systems. In one aspect, the method includes executing a first task in the plurality of tasks on a first computing machine and determining a performance degradation threshold for the first task. The method further includes calculating a predicted performance degradation of the first task when a second task is executed on the first computing machine, wherein the predicted performance degradation is determined by comparing a performance interference score of the second task with a performance sensitivity curve of the first task. The method further includes executing the second task on the first computing machine when the predicted performance degradation of the first task is below the performance degradation threshold. |
59. | Hsu, Chang-Hong; Zhang, Yunqi; Laurenzano, Michael A; Meisner, David; Wenisch, Thomas; Dreslinski, Ronald G; Mars, Jason; Tang, Lingjia Reining in long tails in warehouse-scale computers with quick voltage boosting using adrenaline Journal Article ACM Transactions on Computer Systems (TOCS), 35 (1), pp. 1–33, 2017. @article{hsu2017reining, title = {Reining in long tails in warehouse-scale computers with quick voltage boosting using adrenaline}, author = {Chang-Hong Hsu and Yunqi Zhang and Michael A Laurenzano and David Meisner and Thomas Wenisch and Ronald G Dreslinski and Jason Mars and Lingjia Tang}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3054742.pdf}, year = {2017}, date = {2017-01-01}, journal = {ACM Transactions on Computer Systems (TOCS)}, volume = {35}, number = {1}, pages = {1--33}, publisher = {ACM New York, NY, USA}, abstract = {Reducing the long tail of the query latency distribution in modern warehouse scale computers is critical for improving performance and quality of service (QoS) of workloads such as Web Search and Memcached. Traditional turbo boost increases a processor’s voltage and frequency during a coarse-grained sliding window, boosting all queries that are processed during that window. However, the inability of such a technique to pinpoint tail queries for boosting limits its tail reduction benefit. In this work, we propose Adrenaline, an approach to leverage finer-granularity (tens of nanoseconds) voltage boosting to effectively rein in the tail latency with query-level precision. Two key insights underlie this work. First, emerging finer granularity voltage/frequency boosting is an enabling mechanism for intelligent allocation of the power budget to precisely boost only the queries that contribute to the tail latency; second, per-query characteristics can be used to design indicators for proactively pinpointing these queries, triggering boosting accordingly. Based on these insights, Adrenaline effectively pinpoints and boosts queries that are likely to increase the tail distribution and can reap more benefit from the voltage/frequency boost. By evaluating under various workload configurations, we demonstrate the effectiveness of our methodology. We achieve up to a 2.50 × tail latency improvement for Memcached and up to a 3.03 × for Web Search over coarse-grained dynamic voltage and frequency scaling (DVFS) given a fixed boosting power budget. When optimizing for energy reduction, Adrenaline achieves up to a 1.81 × improvement for Memcached and up to a 1.99 × for Web Search over coarse-grained DVFS. By using the carefully chosen boost thresholds, Adrenaline further improves the tail latency reduction to 4.82 × over coarse-grained DVFS.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Reducing the long tail of the query latency distribution in modern warehouse scale computers is critical for improving performance and quality of service (QoS) of workloads such as Web Search and Memcached. Traditional turbo boost increases a processor’s voltage and frequency during a coarse-grained sliding window, boosting all queries that are processed during that window. However, the inability of such a technique to pinpoint tail queries for boosting limits its tail reduction benefit. In this work, we propose Adrenaline, an approach to leverage finer-granularity (tens of nanoseconds) voltage boosting to effectively rein in the tail latency with query-level precision. Two key insights underlie this work. First, emerging finer granularity voltage/frequency boosting is an enabling mechanism for intelligent allocation of the power budget to precisely boost only the queries that contribute to the tail latency; second, per-query characteristics can be used to design indicators for proactively pinpointing these queries, triggering boosting accordingly. Based on these insights, Adrenaline effectively pinpoints and boosts queries that are likely to increase the tail distribution and can reap more benefit from the voltage/frequency boost. By evaluating under various workload configurations, we demonstrate the effectiveness of our methodology. We achieve up to a 2.50 × tail latency improvement for Memcached and up to a 3.03 × for Web Search over coarse-grained dynamic voltage and frequency scaling (DVFS) given a fixed boosting power budget. When optimizing for energy reduction, Adrenaline achieves up to a 1.81 × improvement for Memcached and up to a 1.99 × for Web Search over coarse-grained DVFS. By using the carefully chosen boost thresholds, Adrenaline further improves the tail latency reduction to 4.82 × over coarse-grained DVFS. |
58. | Kang, Yiping; Hauswald, Johann; Gao, Cao; Rovinski, Austin; Mudge, Trevor; Mars, Jason; Tang, Lingjia Neurosurgeon: Collaborative intelligence between the cloud and mobile edge Journal Article ACM SIGARCH Computer Architecture News, 45 (1), pp. 615–629, 2017. @article{kang2017neurosurgeon, title = {Neurosurgeon: Collaborative intelligence between the cloud and mobile edge}, author = {Yiping Kang and Johann Hauswald and Cao Gao and Austin Rovinski and Trevor Mudge and Jason Mars and Lingjia Tang}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3037697.3037698.pdf}, year = {2017}, date = {2017-01-01}, journal = {ACM SIGARCH Computer Architecture News}, volume = {45}, number = {1}, pages = {615--629}, publisher = {ACM New York, NY, USA}, abstract = {The computation for today's intelligent personal assistants such as Apple Siri, Google Now, and Microsoft Cortana, is performed in the cloud. This cloud-only approach requires significant amounts of data to be sent to the cloud over the wireless network and puts significant computational pressure on the datacenter. However, as the computational resources in mobile devices become more powerful and energy efficient, questions arise as to whether this cloud-only processing is desirable moving forward, and what are the implications of pushing some or all of this compute to the mobile devices on the edge. In this paper, we examine the status quo approach of cloud-only processing and investigate computation partitioning strategies that effectively leverage both the cycles in the cloud and on the mobile device to achieve low latency, low energy consumption, and high datacenter throughput for this class of intelligent applications. Our study uses 8 intelligent applications spanning computer vision, speech, and natural language domains, all employing state-of-the-art Deep Neural Networks (DNNs) as the core machine learning technique. We find that given the characteristics of DNN algorithms, a fine-grained, layer-level computation partitioning strategy based on the data and computation variations of each layer within a DNN has significant latency and energy advantages over the status quo approach. Using this insight, we design Neurosurgeon, a lightweight scheduler to automatically partition DNN computation between mobile devices and datacenters at the granularity of neural network layers. Neurosurgeon does not require per-application profiling. It adapts to various DNN architectures, hardware platforms, wireless networks, and server load levels, intelligently partitioning computation for best latency or best mobile energy. We evaluate Neurosurgeon on a state-of-the-art mobile development platform and show that it improves end-to-end latency by 3.1X on average and up to 40.7X, reduces mobile energy consumption by 59.5% on average and up to 94.7%, and improves datacenter throughput by 1.5X on average and up to 6.7X.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The computation for today's intelligent personal assistants such as Apple Siri, Google Now, and Microsoft Cortana, is performed in the cloud. This cloud-only approach requires significant amounts of data to be sent to the cloud over the wireless network and puts significant computational pressure on the datacenter. However, as the computational resources in mobile devices become more powerful and energy efficient, questions arise as to whether this cloud-only processing is desirable moving forward, and what are the implications of pushing some or all of this compute to the mobile devices on the edge. In this paper, we examine the status quo approach of cloud-only processing and investigate computation partitioning strategies that effectively leverage both the cycles in the cloud and on the mobile device to achieve low latency, low energy consumption, and high datacenter throughput for this class of intelligent applications. Our study uses 8 intelligent applications spanning computer vision, speech, and natural language domains, all employing state-of-the-art Deep Neural Networks (DNNs) as the core machine learning technique. We find that given the characteristics of DNN algorithms, a fine-grained, layer-level computation partitioning strategy based on the data and computation variations of each layer within a DNN has significant latency and energy advantages over the status quo approach. Using this insight, we design Neurosurgeon, a lightweight scheduler to automatically partition DNN computation between mobile devices and datacenters at the granularity of neural network layers. Neurosurgeon does not require per-application profiling. It adapts to various DNN architectures, hardware platforms, wireless networks, and server load levels, intelligently partitioning computation for best latency or best mobile energy. We evaluate Neurosurgeon on a state-of-the-art mobile development platform and show that it improves end-to-end latency by 3.1X on average and up to 40.7X, reduces mobile energy consumption by 59.5% on average and up to 94.7%, and improves datacenter throughput by 1.5X on average and up to 6.7X. |
57. | Chen, Quan; Yang, Hailong; Guo, Minyi; Kannan, Ram Srivatsa; Mars, Jason; Tang, Lingjia Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers Inproceedings Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 17–32, 2017. @inproceedings{chen2017prophet, title = {Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers}, author = {Quan Chen and Hailong Yang and Minyi Guo and Ram Srivatsa Kannan and Jason Mars and Lingjia Tang}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3093336.3037700.pdf}, year = {2017}, date = {2017-01-01}, booktitle = {Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems}, pages = {17--32}, abstract = {Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key challenge is that when applications are co-located on a server, performance interference due to resource contention can be detrimental to the application QoS. Although prior work has proposed techniques to identify "safe" co-locations where application QoS is satisfied by predicting the performance interference on multicores, no such prediction technique on accelerators such as GPUs. In this work, we present Prophet, an approach to precisely predict the performance degradation of latency-sensitive applications on accelerators due to application co-location. We analyzed the performance interference on accelerators through a real system investigation and found that unlike on multicores where the key contentious resources are shared caches and main memory bandwidth, the key contentious resources on accelerators are instead processing elements, accelerator memory bandwidth and PCIe bandwidth. Based on this observation, we designed interference models that enable the precise prediction for processing element, accelerator memory bandwidth and PCIe bandwidth contention on real hardware. By using a novel technique to forecast solo-run execution traces of the co-located applications using interference models, Prophet can accurately predict the performance degradation of latency-sensitive applications on non-preemptive accelerators. Using Prophet, we can identify "safe" co-locations on accelerators to improve utilization without violating the QoS target. Our evaluation shows that Prophet can predict the performance degradation with an average prediction error 5.47% on real systems. Meanwhile, based on the prediction, Prophet achieves accelerator utilization improvements of 49.9% on average while maintaining the QoS target of latency-sensitive applications.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key challenge is that when applications are co-located on a server, performance interference due to resource contention can be detrimental to the application QoS. Although prior work has proposed techniques to identify "safe" co-locations where application QoS is satisfied by predicting the performance interference on multicores, no such prediction technique on accelerators such as GPUs. In this work, we present Prophet, an approach to precisely predict the performance degradation of latency-sensitive applications on accelerators due to application co-location. We analyzed the performance interference on accelerators through a real system investigation and found that unlike on multicores where the key contentious resources are shared caches and main memory bandwidth, the key contentious resources on accelerators are instead processing elements, accelerator memory bandwidth and PCIe bandwidth. Based on this observation, we designed interference models that enable the precise prediction for processing element, accelerator memory bandwidth and PCIe bandwidth contention on real hardware. By using a novel technique to forecast solo-run execution traces of the co-located applications using interference models, Prophet can accurately predict the performance degradation of latency-sensitive applications on non-preemptive accelerators. Using Prophet, we can identify "safe" co-locations on accelerators to improve utilization without violating the QoS target. Our evaluation shows that Prophet can predict the performance degradation with an average prediction error 5.47% on real systems. Meanwhile, based on the prediction, Prophet achieves accelerator utilization improvements of 49.9% on average while maintaining the QoS target of latency-sensitive applications. |
56. | Yang, Hailong; Chen, Quan; Riaz, Moeiz; Luan, Zhongzhi; Tang, Lingjia; Mars, Jason Powerchief: Intelligent power allocation for multi-stage applications to improve responsiveness on power constrained cmp Inproceedings Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 133–146, 2017. @inproceedings{yang2017powerchief, title = {Powerchief: Intelligent power allocation for multi-stage applications to improve responsiveness on power constrained cmp}, author = {Hailong Yang and Quan Chen and Moeiz Riaz and Zhongzhi Luan and Lingjia Tang and Jason Mars}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3079856.3080224.pdf}, year = {2017}, date = {2017-01-01}, booktitle = {Proceedings of the 44th Annual International Symposium on Computer Architecture}, pages = {133--146}, abstract = {Modern user facing applications consist of multiple processing stages with a number of service instances in each stage. The latency profile of these multi-stage applications is intrinsically variable, making it challenging to provide satisfactory responsiveness. Given a limited power budget, improving the end-to-end latency requires intelligently boosting the bottleneck service across stages using multiple boosting techniques. However, prior work fail to acknowledge the multi-stage nature of user-facing applications and perform poorly in improving responsiveness on power constrained CMP, as they are unable to accurately identify bottleneck service and apply the boosting techniques adaptively. In this paper, we present PowerChief, a runtime framework that 1) provides joint design of service and query to monitor the latency statistics across service stages and accurately identifies the bottleneck service during runtime; 2) adaptively chooses the boosting technique to accelerate the bottleneck service with improved responsiveness; 3) dynamically reallocates the constrained power budget across service stages to accommodate the chosen boosting technique. Evaluated with real world multi-stage applications, PowerChief improves the average latency by 20.3x and 32.4x (99% tail latency by 13.3x and 19.4x) for Sirius and Natural Language Processing applications respectively compared to stage-agnostic power allocation. In addition, for the given QoS target, PowerChief reduces the power consumption of Sirius and Web Search applications by 23% and 33% respectively over prior work.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Modern user facing applications consist of multiple processing stages with a number of service instances in each stage. The latency profile of these multi-stage applications is intrinsically variable, making it challenging to provide satisfactory responsiveness. Given a limited power budget, improving the end-to-end latency requires intelligently boosting the bottleneck service across stages using multiple boosting techniques. However, prior work fail to acknowledge the multi-stage nature of user-facing applications and perform poorly in improving responsiveness on power constrained CMP, as they are unable to accurately identify bottleneck service and apply the boosting techniques adaptively. In this paper, we present PowerChief, a runtime framework that 1) provides joint design of service and query to monitor the latency statistics across service stages and accurately identifies the bottleneck service during runtime; 2) adaptively chooses the boosting technique to accelerate the bottleneck service with improved responsiveness; 3) dynamically reallocates the constrained power budget across service stages to accommodate the chosen boosting technique. Evaluated with real world multi-stage applications, PowerChief improves the average latency by 20.3x and 32.4x (99% tail latency by 13.3x and 19.4x) for Sirius and Natural Language Processing applications respectively compared to stage-agnostic power allocation. In addition, for the given QoS target, PowerChief reduces the power consumption of Sirius and Web Search applications by 23% and 33% respectively over prior work. |
55. | Skach, Matt; Aurora, Manish; Hsu, Chang-Hong; Li, Qi; Tullsen, Dean; Tang, Lingjia; Mars, Jason Thermal time shifting: Decreasing datacenter cooling costs with phase change materials Journal Article IEEE Internet Computing, 2017. @article{skach2017thermal, title = {Thermal time shifting: Decreasing datacenter cooling costs with phase change materials}, author = {Matt Skach and Manish Aurora and Chang-Hong Hsu and Qi Li and Dean Tullsen and Lingjia Tang and Jason Mars}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/2749469.2749474.pdf}, year = {2017}, date = {2017-01-01}, journal = {IEEE Internet Computing}, publisher = {IEEE}, abstract = {Datacenters, or warehouse scale computers, are rapidly increasing in size and power consumption. However, this growth comes at the cost of an increasing thermal load that must be removed to prevent overheating and server failure. In this paper, we propose to use phase changing materials (PCM) to shape the thermal load of a datacenter, absorbing and releasing heat when it is advantageous to do so. We present and validate a methodology to study the impact of PCM on a datacenter, and evaluate two important opportunities for cost savings. We find that in a datacenter with full cooling system subscription, PCM can reduce the necessary cooling system size by up to 12% without impacting peak throughput, or increase the number of servers by up to 14.6% without increasing the cooling load. In a thermally constrained setting, PCM can increase peak throughput up to 69% while delaying the onset of thermal limits by over 3 hours.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Datacenters, or warehouse scale computers, are rapidly increasing in size and power consumption. However, this growth comes at the cost of an increasing thermal load that must be removed to prevent overheating and server failure. In this paper, we propose to use phase changing materials (PCM) to shape the thermal load of a datacenter, absorbing and releasing heat when it is advantageous to do so. We present and validate a methodology to study the impact of PCM on a datacenter, and evaluate two important opportunities for cost savings. We find that in a datacenter with full cooling system subscription, PCM can reduce the necessary cooling system size by up to 12% without impacting peak throughput, or increase the number of servers by up to 14.6% without increasing the cooling load. In a thermally constrained setting, PCM can increase peak throughput up to 69% while delaying the onset of thermal limits by over 3 hours. |
54. | Hill, Parker; Jain, Animesh; Hill, Mason; Zamirai, Babak; Hsu, Chang-Hong; Laurenzano, Michael A; Mahlke, Scott; Tang, Lingjia; Mars, Jason Deftnn: Addressing bottlenecks for dnn execution on gpus via synapse vector elimination and near-compute data fission Inproceedings Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 786–799, 2017. @inproceedings{hill2017deftnn, title = {Deftnn: Addressing bottlenecks for dnn execution on gpus via synapse vector elimination and near-compute data fission}, author = {Parker Hill and Animesh Jain and Mason Hill and Babak Zamirai and Chang-Hong Hsu and Michael A Laurenzano and Scott Mahlke and Lingjia Tang and Jason Mars}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3123939.3123970.pdf}, year = {2017}, date = {2017-01-01}, booktitle = {Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture}, pages = {786--799}, abstract = {Deep neural networks (DNNs) are key computational building blocks for emerging classes of web services that interact in real time with users via voice, images and video inputs. Although GPUs have gained popularity as a key accelerator platform for deep learning workloads, the increasing demand for DNN computation leaves a significant gap between the compute capabilities of GPU-enabled datacenters and the compute needed to service demand. The state-of-the-art techniques to improve DNN performance have significant limitations in bridging the gap on real systems. Current network pruning techniques remove computation, but the resulting networks map poorly to GPU architectures, yielding no performance benefit or even slowdowns. Meanwhile, current bandwidth optimization techniques focus on reducing off-chip bandwidth while overlooking on-chip bandwidth, a key DNN bottleneck. To address these limitations, this work introduces DeftNN, a GPU DNN execution framework that targets the key architectural bottlenecks of DNNs on GPUs to automatically and transparently improve execution performance. DeftNN is composed of two novel optimization techniques - (1) synapse vector elimination, a technique that identifies non-contributing synapses in the DNN and carefully transforms data and removes the computation and data movement of these synapses while fully utilizing the GPU to improve performance, and (2) near-compute data fission, a mechanism for scaling down the on-chip data movement requirements within DNN computations. Our evaluation of DeftNN spans 6 state-of-the-art DNNs. By applying both optimizations in concert, DeftNN is able to achieve an average speedup of 2.1X on real GPU hardware. We also introduce a small additional hardware unit per GPU core to facilitate efficient data fission operations, increasing the speedup achieved by DeftNN to 2.6X.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Deep neural networks (DNNs) are key computational building blocks for emerging classes of web services that interact in real time with users via voice, images and video inputs. Although GPUs have gained popularity as a key accelerator platform for deep learning workloads, the increasing demand for DNN computation leaves a significant gap between the compute capabilities of GPU-enabled datacenters and the compute needed to service demand. The state-of-the-art techniques to improve DNN performance have significant limitations in bridging the gap on real systems. Current network pruning techniques remove computation, but the resulting networks map poorly to GPU architectures, yielding no performance benefit or even slowdowns. Meanwhile, current bandwidth optimization techniques focus on reducing off-chip bandwidth while overlooking on-chip bandwidth, a key DNN bottleneck. To address these limitations, this work introduces DeftNN, a GPU DNN execution framework that targets the key architectural bottlenecks of DNNs on GPUs to automatically and transparently improve execution performance. DeftNN is composed of two novel optimization techniques - (1) synapse vector elimination, a technique that identifies non-contributing synapses in the DNN and carefully transforms data and removes the computation and data movement of these synapses while fully utilizing the GPU to improve performance, and (2) near-compute data fission, a mechanism for scaling down the on-chip data movement requirements within DNN computations. Our evaluation of DeftNN spans 6 state-of-the-art DNNs. By applying both optimizations in concert, DeftNN is able to achieve an average speedup of 2.1X on real GPU hardware. We also introduce a small additional hardware unit per GPU core to facilitate efficient data fission operations, increasing the speedup achieved by DeftNN to 2.6X. |
2016 |
|
53. | Jain, Animesh; Hill, Parker; Laurenzano, Michael A; Haque, Md E; Khan, Muneeb; Mahlke, Scott; Tang, Lingjia; Mars, Jason CPSA: Compute precisely store approximately Inproceedings Workshop on Approximate Computing Across the Stack, 2016. @inproceedings{jain2016cpsa, title = {CPSA: Compute precisely store approximately}, author = {Animesh Jain and Parker Hill and Michael A Laurenzano and Md E Haque and Muneeb Khan and Scott Mahlke and Lingjia Tang and Jason Mars}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/jain.pdf}, year = {2016}, date = {2016-01-01}, booktitle = {Workshop on Approximate Computing Across the Stack}, abstract = {We propose a new approximate-computing paradigm, where computations are performed precisely while the data is stored approximately in the memory using data packing. This lets us reduce the memory traffic, improving application memory behavior. It achieves 85% memory savings for an accuracy target of 90%.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } We propose a new approximate-computing paradigm, where computations are performed precisely while the data is stored approximately in the memory using data packing. This lets us reduce the memory traffic, improving application memory behavior. It achieves 85% memory savings for an accuracy target of 90%. |
52. | Hauswald, Johann; Laurenzano, Michael A; Zhang, Yunqi; Yang, Hailong; Kang, Yiping; Li, Cheng; Rovinski, Austin; Khurana, Arjun; Dreslinski, Ronald G; Mudge, Trevor; others, Designing future warehouse-scale computers for Sirius, an end-to-end voice and vision personal assistant Journal Article ACM Transactions on Computer Systems (TOCS), 34 (1), pp. 1–32, 2016. @article{hauswald2016designing, title = {Designing future warehouse-scale computers for Sirius, an end-to-end voice and vision personal assistant}, author = {Johann Hauswald and Michael A Laurenzano and Yunqi Zhang and Hailong Yang and Yiping Kang and Cheng Li and Austin Rovinski and Arjun Khurana and Ronald G Dreslinski and Trevor Mudge and others}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/2870631.pdf}, year = {2016}, date = {2016-01-01}, journal = {ACM Transactions on Computer Systems (TOCS)}, volume = {34}, number = {1}, pages = {1--32}, publisher = {ACM New York, NY, USA}, abstract = {As user demand scales for intelligent personal assistants (IPAs) such as Apple’s Siri, Google’s Google Now, and Microsoft’s Cortana, we are approaching the computational limits of current datacenter (DC) architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this article, we present the design of Sirius, an open end-to-end IPA Web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of eight benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 8.5× and 15×, respectively. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of DCs by 2.3× and 1.3×, respectively.}, keywords = {}, pubstate = {published}, tppubtype = {article} } As user demand scales for intelligent personal assistants (IPAs) such as Apple’s Siri, Google’s Google Now, and Microsoft’s Cortana, we are approaching the computational limits of current datacenter (DC) architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this article, we present the design of Sirius, an open end-to-end IPA Web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of eight benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 8.5× and 15×, respectively. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of DCs by 2.3× and 1.3×, respectively. |
51. | Chen, Quan; Yang, Hailong; Mars, Jason; Tang, Lingjia Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers Journal Article ACM SIGPLAN Notices, 51 (4), pp. 681–696, 2016. @article{chen2016baymax, title = {Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers}, author = {Quan Chen and Hailong Yang and Jason Mars and Lingjia Tang}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/2872362.2872368.pdf}, year = {2016}, date = {2016-01-01}, journal = {ACM SIGPLAN Notices}, volume = {51}, number = {4}, pages = {681--696}, publisher = {ACM New York, NY, USA}, abstract = {Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution. |
50. | Zhang, Yunqi; Meisner, David; Mars, Jason; Tang, Lingjia Treadmill: Attributing the source of tail latency through precise load testing and statistical inference Inproceedings 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 456–468, IEEE 2016. @inproceedings{zhang2016treadmill, title = {Treadmill: Attributing the source of tail latency through precise load testing and statistical inference}, author = {Yunqi Zhang and David Meisner and Jason Mars and Lingjia Tang}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/ISCA.2016.47.pdf}, year = {2016}, date = {2016-01-01}, booktitle = {2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)}, pages = {456--468}, organization = {IEEE}, abstract = {Managing tail latency of requests has become one of the primary challenges for large-scale Internet services. Data centers are quickly evolving and service operators frequently desire to make changes to the deployed software and production hardware configurations. Such changes demand a confident understanding of the impact on one's service, in particular its effect on tail latency (e.g., 95th- or 99th-percentile response latency of the service). Evaluating the impact on the tail is challenging because of its inherent variability. Existing tools and methodologies for measuring these effects suffer from a number of deficiencies including poor load tester design, statistically inaccurate aggregation, and improper attribution of effects. As shown in the paper, these pitfalls can often result in misleading conclusions. In this paper, we develop a methodology for statistically rigorous performance evaluation and performance factor attribution for server workloads. First, we find that careful design of the server load tester can ensure high quality performance evaluation, and empirically demonstrate the inaccuracy of load testers in previous work. Learning from the design flaws in prior work, we design and develop a modular load tester platform, Treadmill, that overcomes pitfalls of existing tools. Next, utilizing Treadmill, we construct measurement and analysis procedures that can properly attribute performance factors. We rely on statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the impact of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Managing tail latency of requests has become one of the primary challenges for large-scale Internet services. Data centers are quickly evolving and service operators frequently desire to make changes to the deployed software and production hardware configurations. Such changes demand a confident understanding of the impact on one's service, in particular its effect on tail latency (e.g., 95th- or 99th-percentile response latency of the service). Evaluating the impact on the tail is challenging because of its inherent variability. Existing tools and methodologies for measuring these effects suffer from a number of deficiencies including poor load tester design, statistically inaccurate aggregation, and improper attribution of effects. As shown in the paper, these pitfalls can often result in misleading conclusions. In this paper, we develop a methodology for statistically rigorous performance evaluation and performance factor attribution for server workloads. First, we find that careful design of the server load tester can ensure high quality performance evaluation, and empirically demonstrate the inaccuracy of load testers in previous work. Learning from the design flaws in prior work, we design and develop a modular load tester platform, Treadmill, that overcomes pitfalls of existing tools. Next, utilizing Treadmill, we construct measurement and analysis procedures that can properly attribute performance factors. We rely on statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the impact of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%. |
49. | Laurenzano, Michael A; Zhang, Yunqi; Chen, Jiang; Tang, Lingjia; Mars, Jason Powerchop: Identifying and managing non-critical units in hybrid processor architectures Inproceedings 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 140–152, IEEE 2016. @inproceedings{laurenzano2016powerchop, title = {Powerchop: Identifying and managing non-critical units in hybrid processor architectures}, author = {Michael A Laurenzano and Yunqi Zhang and Jiang Chen and Lingjia Tang and Jason Mars}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3007787.3001152.pdf}, year = {2016}, date = {2016-01-01}, booktitle = {2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)}, pages = {140--152}, organization = {IEEE}, abstract = {On-core microarchitectural structures consume significant portions of a processor's power budget. However, depending on application characteristics, those structures do not always provide (much) performance benefit. While timeout-based power gating techniques have been leveraged for underutilized cores and inactive functional units, these techniques have not directly translated to high-activity units such as vector processing units, complex branch predictors, and caches. The performance benefit provided by these units does not necessarily correspond with unit activity, but instead is a function of application characteristics. This work introduces PowerChop, a novel technique that leverages the unique capabilities of HW/SW co-designed hybrid processors to enact unit-level power management at the application phase level. PowerChop adds two small additional hardware units to facilitate phase identification and triggering different power states, enabling the software layer to cheaply track, predict and take advantage of varying unit criticality across application phases by powering gating units that are not needed for performant execution. Through detailed experimentation, we find that PowerChop significantly decreases power consumption, reducing the leakage power of a hybrid server processor by 9% on average (up to 33%) and a hybrid mobile processor by 19% (up to 40%) while introducing just 2% slowdown.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } On-core microarchitectural structures consume significant portions of a processor's power budget. However, depending on application characteristics, those structures do not always provide (much) performance benefit. While timeout-based power gating techniques have been leveraged for underutilized cores and inactive functional units, these techniques have not directly translated to high-activity units such as vector processing units, complex branch predictors, and caches. The performance benefit provided by these units does not necessarily correspond with unit activity, but instead is a function of application characteristics. This work introduces PowerChop, a novel technique that leverages the unique capabilities of HW/SW co-designed hybrid processors to enact unit-level power management at the application phase level. PowerChop adds two small additional hardware units to facilitate phase identification and triggering different power states, enabling the software layer to cheaply track, predict and take advantage of varying unit criticality across application phases by powering gating units that are not needed for performant execution. Through detailed experimentation, we find that PowerChop significantly decreases power consumption, reducing the leakage power of a hybrid server processor by 9% on average (up to 33%) and a hybrid mobile processor by 19% (up to 40%) while introducing just 2% slowdown. |
48. | Laurenzano, Michael A; Hill, Parker; Samadi, Mehrzad; Mahlke, Scott; Mars, Jason; Tang, Lingjia Input responsiveness: using canary inputs to dynamically steer approximation Inproceedings Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 161–176, 2016. @inproceedings{laurenzano2016input, title = {Input responsiveness: using canary inputs to dynamically steer approximation}, author = {Michael A Laurenzano and Parker Hill and Mehrzad Samadi and Scott Mahlke and Jason Mars and Lingjia Tang}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/2908080.2908087.pdf}, year = {2016}, date = {2016-01-01}, booktitle = {Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation}, pages = {161--176}, abstract = {This paper introduces Input Responsive Approximation (IRA), an approach that uses a canary input — a small program input carefully constructed to capture the intrinsic properties of the original input — to automatically control how program approximation is applied on an input-by-input basis. Motivating this approach is the observation that many of the prior techniques focusing on choosing how to approximate arrive at conservative decisions by discounting substantial differences between inputs when applying approximation. The main challenges in overcoming this limitation lie in making the choice of how to approximate both effectively (e.g., the fastest approximation that meets a particular accuracy target) and rapidly for every input. With IRA, each time the approximate program is run, a canary input is constructed and used dynamically to quickly test a spectrum of approximation alternatives. Based on these runtime tests, the approximation that best fits the desired accuracy constraints is selected and applied to the full input to produce an approximate result. We use IRA to select and parameterize mixes of four approximation techniques from the literature for a range of 13 image processing, machine learning, and data mining applications. Our results demonstrate that IRA significantly outperforms prior approaches, delivering an average of 10.2× speedup over exact execution while minimizing accuracy losses in program outputs.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper introduces Input Responsive Approximation (IRA), an approach that uses a canary input — a small program input carefully constructed to capture the intrinsic properties of the original input — to automatically control how program approximation is applied on an input-by-input basis. Motivating this approach is the observation that many of the prior techniques focusing on choosing how to approximate arrive at conservative decisions by discounting substantial differences between inputs when applying approximation. The main challenges in overcoming this limitation lie in making the choice of how to approximate both effectively (e.g., the fastest approximation that meets a particular accuracy target) and rapidly for every input. With IRA, each time the approximate program is run, a canary input is constructed and used dynamically to quickly test a spectrum of approximation alternatives. Based on these runtime tests, the approximation that best fits the desired accuracy constraints is selected and applied to the full input to produce an approximate result. We use IRA to select and parameterize mixes of four approximation techniques from the literature for a range of 13 image processing, machine learning, and data mining applications. Our results demonstrate that IRA significantly outperforms prior approaches, delivering an average of 10.2× speedup over exact execution while minimizing accuracy losses in program outputs. |
47. | Jain, Animesh; Laurenzano, Michael A; Tang, Lingjia; Mars, Jason Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting Inproceedings 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, IEEE 2016. @inproceedings{jain2016continuous, title = {Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting}, author = {Animesh Jain and Michael A Laurenzano and Lingjia Tang and Jason Mars}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3195638.3195666.pdf}, year = {2016}, date = {2016-01-01}, booktitle = {2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)}, pages = {1--12}, organization = {IEEE}, abstract = {The class of optimizations characterized by manipulating a loop's interaction space for improved cache locality and reuse (i.e, cache tiling / blocking / strip mine and interchange) are static optimizations requiring a priori information about the microarchitectural and runtime environment of an application binary. However, particularly in datacenter environments, deployed applications face numerous dynamic environments over their lifetimes. As a result, this class of optimizations can result in sub-optimal performance due to the inability to flexibly adapt iteration spaces as cache conditions change at runtime. This paper introduces continuous shape shifiting, a compilation approach that removes the risks of cache tiling optimizations by dynamically rewriting (and reshaping) deployed, running application code. To realize continuous shape shifting, we present ShapeShifter, a framework for continuous monitoring of co-running applications and their runtime environments to reshape loop iteration spaces and pinpoint near-optimal loop tile configurations. Upon identifying a need for reshaping, a new tiling approach is quickly constructed for the application, new code is dynamically generated and is then seamlessly stitched into the running application with near-zero overhead. Our evaluation on a wide spectrum of runtime scenarios demonstrates that ShapeShifter achieves an average of 10--40% performance improvement (up to 2.4X) on real systems depending on the runtime environment compared to an oracle static loop tiling baseline.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The class of optimizations characterized by manipulating a loop's interaction space for improved cache locality and reuse (i.e, cache tiling / blocking / strip mine and interchange) are static optimizations requiring a priori information about the microarchitectural and runtime environment of an application binary. However, particularly in datacenter environments, deployed applications face numerous dynamic environments over their lifetimes. As a result, this class of optimizations can result in sub-optimal performance due to the inability to flexibly adapt iteration spaces as cache conditions change at runtime. This paper introduces continuous shape shifiting, a compilation approach that removes the risks of cache tiling optimizations by dynamically rewriting (and reshaping) deployed, running application code. To realize continuous shape shifting, we present ShapeShifter, a framework for continuous monitoring of co-running applications and their runtime environments to reshape loop iteration spaces and pinpoint near-optimal loop tile configurations. Upon identifying a need for reshaping, a new tiling approach is quickly constructed for the application, new code is dynamically generated and is then seamlessly stitched into the running application with near-zero overhead. Our evaluation on a wide spectrum of runtime scenarios demonstrates that ShapeShifter achieves an average of 10--40% performance improvement (up to 2.4X) on real systems depending on the runtime environment compared to an oracle static loop tiling baseline. |
46. | Jain, Animesh; Hill, Parker; Lin, Shih-Chieh; Khan, Muneeb; Haque, Md E; Laurenzano, Michael A; Mahlke, Scott; Tang, Lingjia; Mars, Jason Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation Inproceedings 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, IEEE 2016. @inproceedings{jain2016concise, title = {Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation}, author = {Animesh Jain and Parker Hill and Shih-Chieh Lin and Muneeb Khan and Md E Haque and Michael A Laurenzano and Scott Mahlke and Lingjia Tang and Jason Mars}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3195638.3195688.pdf}, year = {2016}, date = {2016-01-01}, booktitle = {2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)}, pages = {1--13}, organization = {IEEE}, abstract = {Cache capacity and memory bandwidth play critical roles in application performance, particularly for data-intensive applications from domains that include machine learning, numerical analysis, and data mining. Many of these applications are also tolerant to imprecise inputs and have loose constraints on the quality of output, making them ideal candidates for approximate computing. This paper introduces a novel approximate computing technique that decouples the format of data in the memory hierarchy from the format of data in the compute subsystem to significantly reduce the cost of storing and moving bits throughout the memory hierarchy and improve application performance. This asymmetric compute-memory extension to conventional architectures, ACME, adds two new instruction classes to the ISA - load-concise and store-concise - along with three small functional units to the micro-architecture to support these instructions. ACME does not affect exact execution of applications and comes into play only when concise memory operations are used. Through detailed experimentation we find that ACME is very effective at trading result accuracy for improved application performance. Our results show that ACME achieves a 1.3X speedup (up to 1.8X) while maintaining 99% accuracy, or a 1.1X speedup while maintaining 99.999% accuracy. Moreover, our approach incurs negligible area and power overheads, adding just 0.005% area and 0.1% power to a conventional modern architecture.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Cache capacity and memory bandwidth play critical roles in application performance, particularly for data-intensive applications from domains that include machine learning, numerical analysis, and data mining. Many of these applications are also tolerant to imprecise inputs and have loose constraints on the quality of output, making them ideal candidates for approximate computing. This paper introduces a novel approximate computing technique that decouples the format of data in the memory hierarchy from the format of data in the compute subsystem to significantly reduce the cost of storing and moving bits throughout the memory hierarchy and improve application performance. This asymmetric compute-memory extension to conventional architectures, ACME, adds two new instruction classes to the ISA - load-concise and store-concise - along with three small functional units to the micro-architecture to support these instructions. ACME does not affect exact execution of applications and comes into play only when concise memory operations are used. Through detailed experimentation we find that ACME is very effective at trading result accuracy for improved application performance. Our results show that ACME achieves a 1.3X speedup (up to 1.8X) while maintaining 99% accuracy, or a 1.1X speedup while maintaining 99.999% accuracy. Moreover, our approach incurs negligible area and power overheads, adding just 0.005% area and 0.1% power to a conventional modern architecture. |
45. | Zekany, Stephen; Rings, Daniel; Harada, Nathan; Laurenzano, Michael A; Tang, Lingjia; Mars, Jason CrystalBall: Statically analyzing runtime behavior via deep sequence learning Inproceedings 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, IEEE 2016. @inproceedings{zekany2016crystalball, title = {CrystalBall: Statically analyzing runtime behavior via deep sequence learning}, author = {Stephen Zekany and Daniel Rings and Nathan Harada and Michael A Laurenzano and Lingjia Tang and Jason Mars}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3195638.3195667.pdf}, year = {2016}, date = {2016-01-01}, booktitle = {2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)}, pages = {1--12}, organization = {IEEE}, abstract = {Understanding dynamic program behavior is critical in many stages of the software development lifecycle, for purposes as diverse as optimization, debugging, testing, and security. This paper focuses on the problem of predicting dynamic program behavior statically. We introduce a novel technique to statically identify hot paths that leverages emerging deep learning techniques to take advantage of their ability to learn subtle, complex relationships between sequences of inputs. This approach maps well to the problem of identifying the behavior of sequences of basic blocks in program execution. Our technique is also designed to operate on the compiler's intermediate representation (IR), as opposed to the approaches taken by prior techniques that have focused primarily on source code, giving our approach language-independence. We describe the pitfalls of conventional metrics used for hot path prediction such as accuracy, and motivate the use of Area Under the Receiver Operating Characteristic curve (AUROC). Through a thorough evaluation of our technique on complex applications that include the SPEC CPU2006 benchmarks, we show that our approach achieves an AUROC of 0.85.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Understanding dynamic program behavior is critical in many stages of the software development lifecycle, for purposes as diverse as optimization, debugging, testing, and security. This paper focuses on the problem of predicting dynamic program behavior statically. We introduce a novel technique to statically identify hot paths that leverages emerging deep learning techniques to take advantage of their ability to learn subtle, complex relationships between sequences of inputs. This approach maps well to the problem of identifying the behavior of sequences of basic blocks in program execution. Our technique is also designed to operate on the compiler's intermediate representation (IR), as opposed to the approaches taken by prior techniques that have focused primarily on source code, giving our approach language-independence. We describe the pitfalls of conventional metrics used for hot path prediction such as accuracy, and motivate the use of Area Under the Receiver Operating Characteristic curve (AUROC). Through a thorough evaluation of our technique on complex applications that include the SPEC CPU2006 benchmarks, we show that our approach achieves an AUROC of 0.85. |
44. | Hauswald, Johann; Laurenzano, Michael A; Zhang, Yunqi; Li, Cheng; Rovinski, Austin; Khurana, Arjun; Dreslinski, Ronald G; Mudge, Trevor; Petrucci, Vinicius; Tang, Lingjia; others, Sirius implications for future warehouse-scale computers Journal Article IEEE Micro, 36 (3), pp. 42–53, 2016. @article{hauswald2016sirius, title = {Sirius implications for future warehouse-scale computers}, author = {Johann Hauswald and Michael A Laurenzano and Yunqi Zhang and Cheng Li and Austin Rovinski and Arjun Khurana and Ronald G Dreslinski and Trevor Mudge and Vinicius Petrucci and Lingjia Tang and others}, url = {https://www.jasonmars.org/wp-content/uploads/2020/04/07478443.pdf}, year = {2016}, date = {2016-01-01}, journal = {IEEE Micro}, volume = {36}, number = {3}, pages = {42--53}, publisher = {IEEE}, abstract = {Demand is expected to grow significantly for cloud services that deliver sophisticated artificial intelligence on the critical path of user queries, as is the case with intelligent personal assistants such as Apple's Siri. If the prediction of the trend is correct, these types of applications will likely consume most of the world's computing cycles. The Sirius project was motivated to investigate what this future might look like and how cloud architectures should evolve to achieve it.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Demand is expected to grow significantly for cloud services that deliver sophisticated artificial intelligence on the critical path of user queries, as is the case with intelligent personal assistants such as Apple's Siri. If the prediction of the trend is correct, these types of applications will likely consume most of the world's computing cycles. The Sirius project was motivated to investigate what this future might look like and how cloud architectures should evolve to achieve it. |
_publications_
2018 |
|
63. | Virtual melting temperature: managing server load to minimize cooling overhead with phase change materials Inproceedings 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 15–28, IEEE 2018. |
62. | Adasa: A Conversational In-Vehicle Digital Assistant for Advanced Driver Assistance Features Inproceedings Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, pp. 531–542, 2018. |
61. | Architectural support for convolutional neural networks on modern CPUs Inproceedings Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, pp. 1–13, 2018. |
2017 |
|
60. | Allocation of tasks in large scale computing systems Miscellaneous 2017, (US Patent 9,563,532). |
59. | Reining in long tails in warehouse-scale computers with quick voltage boosting using adrenaline Journal Article ACM Transactions on Computer Systems (TOCS), 35 (1), pp. 1–33, 2017. |
58. | Neurosurgeon: Collaborative intelligence between the cloud and mobile edge Journal Article ACM SIGARCH Computer Architecture News, 45 (1), pp. 615–629, 2017. |
57. | Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers Inproceedings Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 17–32, 2017. |
56. | Powerchief: Intelligent power allocation for multi-stage applications to improve responsiveness on power constrained cmp Inproceedings Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 133–146, 2017. |
55. | Thermal time shifting: Decreasing datacenter cooling costs with phase change materials Journal Article IEEE Internet Computing, 2017. |
54. | Deftnn: Addressing bottlenecks for dnn execution on gpus via synapse vector elimination and near-compute data fission Inproceedings Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 786–799, 2017. |
2016 |
|
53. | CPSA: Compute precisely store approximately Inproceedings Workshop on Approximate Computing Across the Stack, 2016. |
52. | Designing future warehouse-scale computers for Sirius, an end-to-end voice and vision personal assistant Journal Article ACM Transactions on Computer Systems (TOCS), 34 (1), pp. 1–32, 2016. |
51. | Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers Journal Article ACM SIGPLAN Notices, 51 (4), pp. 681–696, 2016. |
50. | Treadmill: Attributing the source of tail latency through precise load testing and statistical inference Inproceedings 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 456–468, IEEE 2016. |
49. | Powerchop: Identifying and managing non-critical units in hybrid processor architectures Inproceedings 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 140–152, IEEE 2016. |
48. | Input responsiveness: using canary inputs to dynamically steer approximation Inproceedings Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 161–176, 2016. |
47. | Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting Inproceedings 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, IEEE 2016. |
46. | Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation Inproceedings 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, IEEE 2016. |
45. | CrystalBall: Statically analyzing runtime behavior via deep sequence learning Inproceedings 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, IEEE 2016. |
44. | Sirius implications for future warehouse-scale computers Journal Article IEEE Micro, 36 (3), pp. 42–53, 2016. |
© 2021 · Jason Mars