2017
Yiping Kang; Johann Hauswald; Cao Gao; Austin Rovinski; Trevor Mudge; Jason Mars; Lingjia Tang
Neurosurgeon: Collaborative intelligence between the cloud and mobile edge Journal Article
In: ACM SIGARCH Computer Architecture News, vol. 45, no. 1, pp. 615–629, 2017.
@article{kang2017neurosurgeon,
title = {Neurosurgeon: Collaborative intelligence between the cloud and mobile edge},
author = {Yiping Kang and Johann Hauswald and Cao Gao and Austin Rovinski and Trevor Mudge and Jason Mars and Lingjia Tang},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3037697.3037698.pdf},
year = {2017},
date = {2017-01-01},
journal = {ACM SIGARCH Computer Architecture News},
volume = {45},
number = {1},
pages = {615--629},
publisher = {ACM New York, NY, USA},
abstract = {The computation for today's intelligent personal assistants such as Apple Siri, Google Now, and Microsoft Cortana, is performed in the cloud. This cloud-only approach requires significant amounts of data to be sent to the cloud over the wireless network and puts significant computational pressure on the datacenter. However, as the computational resources in mobile devices become more powerful and energy efficient, questions arise as to whether this cloud-only processing is desirable moving forward, and what are the implications of pushing some or all of this compute to the mobile devices on the edge.
In this paper, we examine the status quo approach of cloud-only processing and investigate computation partitioning strategies that effectively leverage both the cycles in the cloud and on the mobile device to achieve low latency, low energy consumption, and high datacenter throughput for this class of intelligent applications. Our study uses 8 intelligent applications spanning computer vision, speech, and natural language domains, all employing state-of-the-art Deep Neural Networks (DNNs) as the core machine learning technique. We find that given the characteristics of DNN algorithms, a fine-grained, layer-level computation partitioning strategy based on the data and computation variations of each layer within a DNN has significant latency and energy advantages over the status quo approach.
Using this insight, we design Neurosurgeon, a lightweight scheduler to automatically partition DNN computation between mobile devices and datacenters at the granularity of neural network layers. Neurosurgeon does not require per-application profiling. It adapts to various DNN architectures, hardware platforms, wireless networks, and server load levels, intelligently partitioning computation for best latency or best mobile energy. We evaluate Neurosurgeon on a state-of-the-art mobile development platform and show that it improves end-to-end latency by 3.1X on average and up to 40.7X, reduces mobile energy consumption by 59.5% on average and up to 94.7%, and improves datacenter throughput by 1.5X on average and up to 6.7X.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
In this paper, we examine the status quo approach of cloud-only processing and investigate computation partitioning strategies that effectively leverage both the cycles in the cloud and on the mobile device to achieve low latency, low energy consumption, and high datacenter throughput for this class of intelligent applications. Our study uses 8 intelligent applications spanning computer vision, speech, and natural language domains, all employing state-of-the-art Deep Neural Networks (DNNs) as the core machine learning technique. We find that given the characteristics of DNN algorithms, a fine-grained, layer-level computation partitioning strategy based on the data and computation variations of each layer within a DNN has significant latency and energy advantages over the status quo approach.
Using this insight, we design Neurosurgeon, a lightweight scheduler to automatically partition DNN computation between mobile devices and datacenters at the granularity of neural network layers. Neurosurgeon does not require per-application profiling. It adapts to various DNN architectures, hardware platforms, wireless networks, and server load levels, intelligently partitioning computation for best latency or best mobile energy. We evaluate Neurosurgeon on a state-of-the-art mobile development platform and show that it improves end-to-end latency by 3.1X on average and up to 40.7X, reduces mobile energy consumption by 59.5% on average and up to 94.7%, and improves datacenter throughput by 1.5X on average and up to 6.7X.
Quan Chen; Hailong Yang; Minyi Guo; Ram Srivatsa Kannan; Jason Mars; Lingjia Tang
Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers Proceedings Article
In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 17–32, 2017.
@inproceedings{chen2017prophet,
title = {Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers},
author = {Quan Chen and Hailong Yang and Minyi Guo and Ram Srivatsa Kannan and Jason Mars and Lingjia Tang},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3093336.3037700.pdf},
year = {2017},
date = {2017-01-01},
booktitle = {Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {17--32},
abstract = {Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key challenge is that when applications are co-located on a server, performance interference due to resource contention can be detrimental to the application QoS. Although prior work has proposed techniques to identify "safe" co-locations where application QoS is satisfied by predicting the performance interference on multicores, no such prediction technique on accelerators such as GPUs.
In this work, we present Prophet, an approach to precisely predict the performance degradation of latency-sensitive applications on accelerators due to application co-location. We analyzed the performance interference on accelerators through a real system investigation and found that unlike on multicores where the key contentious resources are shared caches and main memory bandwidth, the key contentious resources on accelerators are instead processing elements, accelerator memory bandwidth and PCIe bandwidth. Based on this observation, we designed interference models that enable the precise prediction for processing element, accelerator memory bandwidth and PCIe bandwidth contention on real hardware. By using a novel technique to forecast solo-run execution traces of the co-located applications using interference models, Prophet can accurately predict the performance degradation of latency-sensitive applications on non-preemptive accelerators. Using Prophet, we can identify "safe" co-locations on accelerators to improve utilization without violating the QoS target. Our evaluation shows that Prophet can predict the performance degradation with an average prediction error 5.47% on real systems. Meanwhile, based on the prediction, Prophet achieves accelerator utilization improvements of 49.9% on average while maintaining the QoS target of latency-sensitive applications.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this work, we present Prophet, an approach to precisely predict the performance degradation of latency-sensitive applications on accelerators due to application co-location. We analyzed the performance interference on accelerators through a real system investigation and found that unlike on multicores where the key contentious resources are shared caches and main memory bandwidth, the key contentious resources on accelerators are instead processing elements, accelerator memory bandwidth and PCIe bandwidth. Based on this observation, we designed interference models that enable the precise prediction for processing element, accelerator memory bandwidth and PCIe bandwidth contention on real hardware. By using a novel technique to forecast solo-run execution traces of the co-located applications using interference models, Prophet can accurately predict the performance degradation of latency-sensitive applications on non-preemptive accelerators. Using Prophet, we can identify "safe" co-locations on accelerators to improve utilization without violating the QoS target. Our evaluation shows that Prophet can predict the performance degradation with an average prediction error 5.47% on real systems. Meanwhile, based on the prediction, Prophet achieves accelerator utilization improvements of 49.9% on average while maintaining the QoS target of latency-sensitive applications.
Hailong Yang; Quan Chen; Moeiz Riaz; Zhongzhi Luan; Lingjia Tang; Jason Mars
Powerchief: Intelligent power allocation for multi-stage applications to improve responsiveness on power constrained cmp Proceedings Article
In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 133–146, 2017.
@inproceedings{yang2017powerchief,
title = {Powerchief: Intelligent power allocation for multi-stage applications to improve responsiveness on power constrained cmp},
author = {Hailong Yang and Quan Chen and Moeiz Riaz and Zhongzhi Luan and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3079856.3080224.pdf},
year = {2017},
date = {2017-01-01},
booktitle = {Proceedings of the 44th Annual International Symposium on Computer Architecture},
pages = {133--146},
abstract = {Modern user facing applications consist of multiple processing stages with a number of service instances in each stage. The latency profile of these multi-stage applications is intrinsically variable, making it challenging to provide satisfactory responsiveness. Given a limited power budget, improving the end-to-end latency requires intelligently boosting the bottleneck service across stages using multiple boosting techniques. However, prior work fail to acknowledge the multi-stage nature of user-facing applications and perform poorly in improving responsiveness on power constrained CMP, as they are unable to accurately identify bottleneck service and apply the boosting techniques adaptively.
In this paper, we present PowerChief, a runtime framework that 1) provides joint design of service and query to monitor the latency statistics across service stages and accurately identifies the bottleneck service during runtime; 2) adaptively chooses the boosting technique to accelerate the bottleneck service with improved responsiveness; 3) dynamically reallocates the constrained power budget across service stages to accommodate the chosen boosting technique. Evaluated with real world multi-stage applications, PowerChief improves the average latency by 20.3x and 32.4x (99% tail latency by 13.3x and 19.4x) for Sirius and Natural Language Processing applications respectively compared to stage-agnostic power allocation. In addition, for the given QoS target, PowerChief reduces the power consumption of Sirius and Web Search applications by 23% and 33% respectively over prior work.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we present PowerChief, a runtime framework that 1) provides joint design of service and query to monitor the latency statistics across service stages and accurately identifies the bottleneck service during runtime; 2) adaptively chooses the boosting technique to accelerate the bottleneck service with improved responsiveness; 3) dynamically reallocates the constrained power budget across service stages to accommodate the chosen boosting technique. Evaluated with real world multi-stage applications, PowerChief improves the average latency by 20.3x and 32.4x (99% tail latency by 13.3x and 19.4x) for Sirius and Natural Language Processing applications respectively compared to stage-agnostic power allocation. In addition, for the given QoS target, PowerChief reduces the power consumption of Sirius and Web Search applications by 23% and 33% respectively over prior work.
Matt Skach; Manish Aurora; Chang-Hong Hsu; Qi Li; Dean Tullsen; Lingjia Tang; Jason Mars
Thermal time shifting: Decreasing datacenter cooling costs with phase change materials Journal Article
In: IEEE Internet Computing, 2017.
@article{skach2017thermal,
title = {Thermal time shifting: Decreasing datacenter cooling costs with phase change materials},
author = {Matt Skach and Manish Aurora and Chang-Hong Hsu and Qi Li and Dean Tullsen and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/2749469.2749474.pdf},
year = {2017},
date = {2017-01-01},
journal = {IEEE Internet Computing},
publisher = {IEEE},
abstract = {Datacenters, or warehouse scale computers, are rapidly increasing in size and power consumption. However, this growth comes at the cost of an increasing thermal load that must be removed to prevent overheating and server failure. In this paper, we propose to use phase changing materials (PCM) to shape the thermal load of a datacenter, absorbing and releasing heat when it is advantageous to do so. We present and validate a methodology to study the impact of PCM on a datacenter, and evaluate two important opportunities for cost savings. We find that in a datacenter with full cooling system subscription, PCM can reduce the necessary cooling system size by up to 12% without impacting peak throughput, or increase the number of servers by up to 14.6% without increasing the cooling load. In a thermally constrained setting, PCM can increase peak throughput up to 69% while delaying the onset of thermal limits by over 3 hours.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Parker Hill; Animesh Jain; Mason Hill; Babak Zamirai; Chang-Hong Hsu; Michael A Laurenzano; Scott Mahlke; Lingjia Tang; Jason Mars
Deftnn: Addressing bottlenecks for dnn execution on gpus via synapse vector elimination and near-compute data fission Proceedings Article
In: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 786–799, 2017.
@inproceedings{hill2017deftnn,
title = {Deftnn: Addressing bottlenecks for dnn execution on gpus via synapse vector elimination and near-compute data fission},
author = {Parker Hill and Animesh Jain and Mason Hill and Babak Zamirai and Chang-Hong Hsu and Michael A Laurenzano and Scott Mahlke and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3123939.3123970.pdf},
year = {2017},
date = {2017-01-01},
booktitle = {Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture},
pages = {786--799},
abstract = {Deep neural networks (DNNs) are key computational building blocks for emerging classes of web services that interact in real time with users via voice, images and video inputs. Although GPUs have gained popularity as a key accelerator platform for deep learning workloads, the increasing demand for DNN computation leaves a significant gap between the compute capabilities of GPU-enabled datacenters and the compute needed to service demand.
The state-of-the-art techniques to improve DNN performance have significant limitations in bridging the gap on real systems. Current network pruning techniques remove computation, but the resulting networks map poorly to GPU architectures, yielding no performance benefit or even slowdowns. Meanwhile, current bandwidth optimization techniques focus on reducing off-chip bandwidth while overlooking on-chip bandwidth, a key DNN bottleneck.
To address these limitations, this work introduces DeftNN, a GPU DNN execution framework that targets the key architectural bottlenecks of DNNs on GPUs to automatically and transparently improve execution performance. DeftNN is composed of two novel optimization techniques - (1) synapse vector elimination, a technique that identifies non-contributing synapses in the DNN and carefully transforms data and removes the computation and data movement of these synapses while fully utilizing the GPU to improve performance, and (2) near-compute data fission, a mechanism for scaling down the on-chip data movement requirements within DNN computations. Our evaluation of DeftNN spans 6 state-of-the-art DNNs. By applying both optimizations in concert, DeftNN is able to achieve an average speedup of 2.1X on real GPU hardware. We also introduce a small additional hardware unit per GPU core to facilitate efficient data fission operations, increasing the speedup achieved by DeftNN to 2.6X.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
The state-of-the-art techniques to improve DNN performance have significant limitations in bridging the gap on real systems. Current network pruning techniques remove computation, but the resulting networks map poorly to GPU architectures, yielding no performance benefit or even slowdowns. Meanwhile, current bandwidth optimization techniques focus on reducing off-chip bandwidth while overlooking on-chip bandwidth, a key DNN bottleneck.
To address these limitations, this work introduces DeftNN, a GPU DNN execution framework that targets the key architectural bottlenecks of DNNs on GPUs to automatically and transparently improve execution performance. DeftNN is composed of two novel optimization techniques - (1) synapse vector elimination, a technique that identifies non-contributing synapses in the DNN and carefully transforms data and removes the computation and data movement of these synapses while fully utilizing the GPU to improve performance, and (2) near-compute data fission, a mechanism for scaling down the on-chip data movement requirements within DNN computations. Our evaluation of DeftNN spans 6 state-of-the-art DNNs. By applying both optimizations in concert, DeftNN is able to achieve an average speedup of 2.1X on real GPU hardware. We also introduce a small additional hardware unit per GPU core to facilitate efficient data fission operations, increasing the speedup achieved by DeftNN to 2.6X.
2016
Animesh Jain; Parker Hill; Michael A Laurenzano; Md E Haque; Muneeb Khan; Scott Mahlke; Lingjia Tang; Jason Mars
CPSA: Compute precisely store approximately Proceedings Article
In: Workshop on Approximate Computing Across the Stack, 2016.
@inproceedings{jain2016cpsa,
title = {CPSA: Compute precisely store approximately},
author = {Animesh Jain and Parker Hill and Michael A Laurenzano and Md E Haque and Muneeb Khan and Scott Mahlke and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/jain.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {Workshop on Approximate Computing Across the Stack},
abstract = {We propose a new approximate-computing paradigm, where computations are performed precisely while the data is stored approximately in the memory using data packing. This lets us reduce the memory traffic, improving application memory behavior. It achieves 85% memory savings for an accuracy target of 90%.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Johann Hauswald; Michael A Laurenzano; Yunqi Zhang; Hailong Yang; Yiping Kang; Cheng Li; Austin Rovinski; Arjun Khurana; Ronald G Dreslinski; Trevor Mudge; others
Designing future warehouse-scale computers for Sirius, an end-to-end voice and vision personal assistant Journal Article
In: ACM Transactions on Computer Systems (TOCS), vol. 34, no. 1, pp. 1–32, 2016.
@article{hauswald2016designing,
title = {Designing future warehouse-scale computers for Sirius, an end-to-end voice and vision personal assistant},
author = {Johann Hauswald and Michael A Laurenzano and Yunqi Zhang and Hailong Yang and Yiping Kang and Cheng Li and Austin Rovinski and Arjun Khurana and Ronald G Dreslinski and Trevor Mudge and others},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/2870631.pdf},
year = {2016},
date = {2016-01-01},
journal = {ACM Transactions on Computer Systems (TOCS)},
volume = {34},
number = {1},
pages = {1--32},
publisher = {ACM New York, NY, USA},
abstract = {As user demand scales for intelligent personal assistants (IPAs) such as Apple’s Siri, Google’s Google Now, and Microsoft’s Cortana, we are approaching the computational limits of current datacenter (DC) architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this article, we present the design of Sirius, an open end-to-end IPA Web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of eight benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 8.5× and 15×, respectively. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of DCs by 2.3× and 1.3×, respectively.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Quan Chen; Hailong Yang; Jason Mars; Lingjia Tang
Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers Journal Article
In: ACM SIGPLAN Notices, vol. 51, no. 4, pp. 681–696, 2016.
@article{chen2016baymax,
title = {Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers},
author = {Quan Chen and Hailong Yang and Jason Mars and Lingjia Tang},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/2872362.2872368.pdf},
year = {2016},
date = {2016-01-01},
journal = {ACM SIGPLAN Notices},
volume = {51},
number = {4},
pages = {681--696},
publisher = {ACM New York, NY, USA},
abstract = {Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Yunqi Zhang; David Meisner; Jason Mars; Lingjia Tang
Treadmill: Attributing the source of tail latency through precise load testing and statistical inference Proceedings Article
In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 456–468, IEEE 2016.
@inproceedings{zhang2016treadmill,
title = {Treadmill: Attributing the source of tail latency through precise load testing and statistical inference},
author = {Yunqi Zhang and David Meisner and Jason Mars and Lingjia Tang},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/ISCA.2016.47.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)},
pages = {456--468},
organization = {IEEE},
abstract = {Managing tail latency of requests has become one of the primary challenges for large-scale Internet services. Data centers are quickly evolving and service operators frequently desire to make changes to the deployed software and production hardware configurations. Such changes demand a confident understanding of the impact on one's service, in particular its effect on tail latency (e.g., 95th- or 99th-percentile response latency of the service). Evaluating the impact on the tail is challenging because of its inherent variability. Existing tools and methodologies for measuring these effects suffer from a number of deficiencies including poor load tester design, statistically inaccurate aggregation, and improper attribution of effects. As shown in the paper, these pitfalls can often result in misleading conclusions.
In this paper, we develop a methodology for statistically rigorous performance evaluation and performance factor attribution for server workloads. First, we find that careful design of the server load tester can ensure high quality performance evaluation, and empirically demonstrate the inaccuracy of load testers in previous work. Learning from the design flaws in prior work, we design and develop a modular load tester platform, Treadmill, that overcomes pitfalls of existing tools. Next, utilizing Treadmill, we construct measurement and analysis procedures that can properly attribute performance factors. We rely on statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the impact of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we develop a methodology for statistically rigorous performance evaluation and performance factor attribution for server workloads. First, we find that careful design of the server load tester can ensure high quality performance evaluation, and empirically demonstrate the inaccuracy of load testers in previous work. Learning from the design flaws in prior work, we design and develop a modular load tester platform, Treadmill, that overcomes pitfalls of existing tools. Next, utilizing Treadmill, we construct measurement and analysis procedures that can properly attribute performance factors. We rely on statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the impact of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%.
Michael A Laurenzano; Yunqi Zhang; Jiang Chen; Lingjia Tang; Jason Mars
Powerchop: Identifying and managing non-critical units in hybrid processor architectures Proceedings Article
In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 140–152, IEEE 2016.
@inproceedings{laurenzano2016powerchop,
title = {Powerchop: Identifying and managing non-critical units in hybrid processor architectures},
author = {Michael A Laurenzano and Yunqi Zhang and Jiang Chen and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3007787.3001152.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)},
pages = {140--152},
organization = {IEEE},
abstract = {On-core microarchitectural structures consume significant portions of a processor's power budget. However, depending on application characteristics, those structures do not always provide (much) performance benefit. While timeout-based power gating techniques have been leveraged for underutilized cores and inactive functional units, these techniques have not directly translated to high-activity units such as vector processing units, complex branch predictors, and caches. The performance benefit provided by these units does not necessarily correspond with unit activity, but instead is a function of application characteristics.
This work introduces PowerChop, a novel technique that leverages the unique capabilities of HW/SW co-designed hybrid processors to enact unit-level power management at the application phase level. PowerChop adds two small additional hardware units to facilitate phase identification and triggering different power states, enabling the software layer to cheaply track, predict and take advantage of varying unit criticality across application phases by powering gating units that are not needed for performant execution. Through detailed experimentation, we find that PowerChop significantly decreases power consumption, reducing the leakage power of a hybrid server processor by 9% on average (up to 33%) and a hybrid mobile processor by 19% (up to 40%) while introducing just 2% slowdown.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This work introduces PowerChop, a novel technique that leverages the unique capabilities of HW/SW co-designed hybrid processors to enact unit-level power management at the application phase level. PowerChop adds two small additional hardware units to facilitate phase identification and triggering different power states, enabling the software layer to cheaply track, predict and take advantage of varying unit criticality across application phases by powering gating units that are not needed for performant execution. Through detailed experimentation, we find that PowerChop significantly decreases power consumption, reducing the leakage power of a hybrid server processor by 9% on average (up to 33%) and a hybrid mobile processor by 19% (up to 40%) while introducing just 2% slowdown.
Michael A Laurenzano; Parker Hill; Mehrzad Samadi; Scott Mahlke; Jason Mars; Lingjia Tang
Input responsiveness: using canary inputs to dynamically steer approximation Proceedings Article
In: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 161–176, 2016.
@inproceedings{laurenzano2016input,
title = {Input responsiveness: using canary inputs to dynamically steer approximation},
author = {Michael A Laurenzano and Parker Hill and Mehrzad Samadi and Scott Mahlke and Jason Mars and Lingjia Tang},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/2908080.2908087.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation},
pages = {161--176},
abstract = {This paper introduces Input Responsive Approximation (IRA), an approach that uses a canary input — a small program input carefully constructed to capture the intrinsic properties of the original input — to automatically control how program approximation is applied on an input-by-input basis. Motivating this approach is the observation that many of the prior techniques focusing on choosing how to approximate arrive at conservative decisions by discounting substantial differences between inputs when applying approximation. The main challenges in overcoming this limitation lie in making the choice of how to approximate both effectively (e.g., the fastest approximation that meets a particular accuracy target) and rapidly for every input. With IRA, each time the approximate program is run, a canary input is constructed and used dynamically to quickly test a spectrum of approximation alternatives. Based on these runtime tests, the approximation that best fits the desired accuracy constraints is selected and applied to the full input to produce an approximate result. We use IRA to select and parameterize mixes of four approximation techniques from the literature for a range of 13 image processing, machine learning, and data mining applications. Our results demonstrate that IRA significantly outperforms prior approaches, delivering an average of 10.2× speedup over exact execution while minimizing accuracy losses in program outputs.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Animesh Jain; Michael A Laurenzano; Lingjia Tang; Jason Mars
Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting Proceedings Article
In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, IEEE 2016.
@inproceedings{jain2016continuous,
title = {Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting},
author = {Animesh Jain and Michael A Laurenzano and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3195638.3195666.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)},
pages = {1--12},
organization = {IEEE},
abstract = {The class of optimizations characterized by manipulating a loop's interaction space for improved cache locality and reuse (i.e, cache tiling / blocking / strip mine and interchange) are static optimizations requiring a priori information about the microarchitectural and runtime environment of an application binary. However, particularly in datacenter environments, deployed applications face numerous dynamic environments over their lifetimes. As a result, this class of optimizations can result in sub-optimal performance due to the inability to flexibly adapt iteration spaces as cache conditions change at runtime.
This paper introduces continuous shape shifiting, a compilation approach that removes the risks of cache tiling optimizations by dynamically rewriting (and reshaping) deployed, running application code. To realize continuous shape shifting, we present ShapeShifter, a framework for continuous monitoring of co-running applications and their runtime environments to reshape loop iteration spaces and pinpoint near-optimal loop tile configurations. Upon identifying a need for reshaping, a new tiling approach is quickly constructed for the application, new code is dynamically generated and is then seamlessly stitched into the running application with near-zero overhead. Our evaluation on a wide spectrum of runtime scenarios demonstrates that ShapeShifter achieves an average of 10--40% performance improvement (up to 2.4X) on real systems depending on the runtime environment compared to an oracle static loop tiling baseline.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper introduces continuous shape shifiting, a compilation approach that removes the risks of cache tiling optimizations by dynamically rewriting (and reshaping) deployed, running application code. To realize continuous shape shifting, we present ShapeShifter, a framework for continuous monitoring of co-running applications and their runtime environments to reshape loop iteration spaces and pinpoint near-optimal loop tile configurations. Upon identifying a need for reshaping, a new tiling approach is quickly constructed for the application, new code is dynamically generated and is then seamlessly stitched into the running application with near-zero overhead. Our evaluation on a wide spectrum of runtime scenarios demonstrates that ShapeShifter achieves an average of 10--40% performance improvement (up to 2.4X) on real systems depending on the runtime environment compared to an oracle static loop tiling baseline.
Animesh Jain; Parker Hill; Shih-Chieh Lin; Muneeb Khan; Md E Haque; Michael A Laurenzano; Scott Mahlke; Lingjia Tang; Jason Mars
Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation Proceedings Article
In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, IEEE 2016.
@inproceedings{jain2016concise,
title = {Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation},
author = {Animesh Jain and Parker Hill and Shih-Chieh Lin and Muneeb Khan and Md E Haque and Michael A Laurenzano and Scott Mahlke and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3195638.3195688.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)},
pages = {1--13},
organization = {IEEE},
abstract = {Cache capacity and memory bandwidth play critical roles in application performance, particularly for data-intensive applications from domains that include machine learning, numerical analysis, and data mining. Many of these applications are also tolerant to imprecise inputs and have loose constraints on the quality of output, making them ideal candidates for approximate computing. This paper introduces a novel approximate computing technique that decouples the format of data in the memory hierarchy from the format of data in the compute subsystem to significantly reduce the cost of storing and moving bits throughout the memory hierarchy and improve application performance. This asymmetric compute-memory extension to conventional architectures, ACME, adds two new instruction classes to the ISA - load-concise and store-concise - along with three small functional units to the micro-architecture to support these instructions. ACME does not affect exact execution of applications and comes into play only when concise memory operations are used. Through detailed experimentation we find that ACME is very effective at trading result accuracy for improved application performance. Our results show that ACME achieves a 1.3X speedup (up to 1.8X) while maintaining 99% accuracy, or a 1.1X speedup while maintaining 99.999% accuracy. Moreover, our approach incurs negligible area and power overheads, adding just 0.005% area and 0.1% power to a conventional modern architecture.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Stephen Zekany; Daniel Rings; Nathan Harada; Michael A Laurenzano; Lingjia Tang; Jason Mars
CrystalBall: Statically analyzing runtime behavior via deep sequence learning Proceedings Article
In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, IEEE 2016.
@inproceedings{zekany2016crystalball,
title = {CrystalBall: Statically analyzing runtime behavior via deep sequence learning},
author = {Stephen Zekany and Daniel Rings and Nathan Harada and Michael A Laurenzano and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3195638.3195667.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)},
pages = {1--12},
organization = {IEEE},
abstract = {Understanding dynamic program behavior is critical in many stages of the software development lifecycle, for purposes as diverse as optimization, debugging, testing, and security. This paper focuses on the problem of predicting dynamic program behavior statically. We introduce a novel technique to statically identify hot paths that leverages emerging deep learning techniques to take advantage of their ability to learn subtle, complex relationships between sequences of inputs. This approach maps well to the problem of identifying the behavior of sequences of basic blocks in program execution. Our technique is also designed to operate on the compiler's intermediate representation (IR), as opposed to the approaches taken by prior techniques that have focused primarily on source code, giving our approach language-independence. We describe the pitfalls of conventional metrics used for hot path prediction such as accuracy, and motivate the use of Area Under the Receiver Operating Characteristic curve (AUROC). Through a thorough evaluation of our technique on complex applications that include the SPEC CPU2006 benchmarks, we show that our approach achieves an AUROC of 0.85.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Johann Hauswald; Michael A Laurenzano; Yunqi Zhang; Cheng Li; Austin Rovinski; Arjun Khurana; Ronald G Dreslinski; Trevor Mudge; Vinicius Petrucci; Lingjia Tang; others
Sirius implications for future warehouse-scale computers Journal Article
In: IEEE Micro, vol. 36, no. 3, pp. 42–53, 2016.
@article{hauswald2016sirius,
title = {Sirius implications for future warehouse-scale computers},
author = {Johann Hauswald and Michael A Laurenzano and Yunqi Zhang and Cheng Li and Austin Rovinski and Arjun Khurana and Ronald G Dreslinski and Trevor Mudge and Vinicius Petrucci and Lingjia Tang and others},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/07478443.pdf},
year = {2016},
date = {2016-01-01},
journal = {IEEE Micro},
volume = {36},
number = {3},
pages = {42--53},
publisher = {IEEE},
abstract = {Demand is expected to grow significantly for cloud services that deliver sophisticated artificial intelligence on the critical path of user queries, as is the case with intelligent personal assistants such as Apple's Siri. If the prediction of the trend is correct, these types of applications will likely consume most of the world's computing cycles. The Sirius project was motivated to investigate what this future might look like and how cloud architectures should evolve to achieve it.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
2015
Vinicius Petrucci; Michael A Laurenzano; John Doherty; Yunqi Zhang; Daniel Mosse; Jason Mars; Lingjia Tang
Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers Proceedings Article
In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 246–258, IEEE 2015.
@inproceedings{petrucci2015octopus,
title = {Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers},
author = {Vinicius Petrucci and Michael A Laurenzano and John Doherty and Yunqi Zhang and Daniel Mosse and Jason Mars and Lingjia Tang},
url = {https://www.jasonmars.org/wp-content/uploads/2020/05/07056037.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)},
pages = {246--258},
organization = {IEEE},
abstract = {Heterogeneous multicore architectures have the potential to improve energy efficiency by integrating power-efficient wimpy cores with high-performing brawny cores. However, it is an open question as how to deliver energy reduction while ensuring the quality of service (QoS) of latency-sensitive web-services running on such heterogeneous multicores in warehouse-scale computers (WSCs). In this work, we first investigate the implications of heterogeneous multicores in WSCs and show that directly adopting heterogeneous multicores without re-designing the software stack to provide QoS management leads to significant QoS violations. We then present Octopus-Man, a novel QoS-aware task management solution that dynamically maps latency-sensitive tasks to the least power-hungry processing resources that are sufficient to meet the QoS requirements. Using carefully-designed feedback-control mechanisms, Octopus-Man addresses critical challenges that emerge due to uncertainties in workload fluctuations and adaptation dynamics in a real system. Our evaluation using web-search and memcached running on a real-system Intel heterogeneous prototype demonstrates that Octopus-Man improves energy efficiency by up to 41% (CPU power) and up to 15% (system power) over an all-brawny WSC design while adhering to specified QoS targets.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Chang-Hong Hsu; Yunqi Zhang; Michael A Laurenzano; David Meisner; Thomas Wenisch; Jason Mars; Lingjia Tang; Ronald G Dreslinski
Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting Proceedings Article
In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 271–282, IEEE 2015.
@inproceedings{hsu2015adrenaline,
title = {Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting},
author = {Chang-Hong Hsu and Yunqi Zhang and Michael A Laurenzano and David Meisner and Thomas Wenisch and Jason Mars and Lingjia Tang and Ronald G Dreslinski},
url = {https://www.jasonmars.org/wp-content/uploads/2020/05/07056039.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)},
pages = {271--282},
organization = {IEEE},
abstract = {Reducing the long tail of the query latency distribution in modern warehouse scale computers is critical for improving performance and quality of service of workloads such as Web Search and Memcached. Traditional turbo boost increases a processor's voltage and frequency during a coarse-grain sliding window, boosting all queries that are processed during that window. However, the inability of such a technique to pinpoint tail queries for boosting limits its tail reduction benefit. In this work, we propose Adrenaline, an approach to leverage finer granularity, 10's of nanoseconds, voltage boosting to effectively rein in the tail latency with query-level precision. Two key insights underlie this work. First, emerging finer granularity voltage/frequency boosting is an enabling mechanism for intelligent allocation of the power budget to precisely boost only the queries that contribute to the tail latency; and second, per-query characteristics can be used to design indicators for proactively pinpointing these queries, triggering boosting accordingly. Based on these insights, Adrenaline effectively pinpoints and boosts queries that are likely to increase the tail distribution and can reap more benefit from the voltage/frequency boost. By evaluating under various workload configurations, we demonstrate the effectiveness of our methodology. We achieve up to a 2.50x tail latency improvement for Memcached and up to a 3.03x for Web Search over coarse-grained DVFS given a fixed boosting power budget. When optimizing for energy reduction, Adrenaline achieves up to a 1.81x improvement for Memcached and up to a 1.99x for Web Search over coarse-grained DVFS.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Johann Hauswald; Michael A Laurenzano; Yunqi Zhang; Cheng Li; Austin Rovinski; Arjun Khurana; Ronald G Dreslinski; Trevor Mudge; Vinicius Petrucci; Lingjia Tang; others
Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers Proceedings Article
In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 223–238, 2015.
@inproceedings{hauswald2015sirius,
title = {Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers},
author = {Johann Hauswald and Michael A Laurenzano and Yunqi Zhang and Cheng Li and Austin Rovinski and Arjun Khurana and Ronald G Dreslinski and Trevor Mudge and Vinicius Petrucci and Lingjia Tang and others},
url = {https://www.jasonmars.org/wp-content/uploads/2020/05/2694344.2694347.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {223--238},
abstract = {As user demand scales for intelligent personal assistants (IPAs) such as Apple's Siri, Google's Google Now, and Microsoft's Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this paper, we present the design of Sirius, an open end-to-end IPA web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs.
To investigate future server designs for Sirius, we decompose Sirius into a suite of 7 benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 10x and 16x. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of datacenters by 2.6x and 1.4x, respectively.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
To investigate future server designs for Sirius, we decompose Sirius into a suite of 7 benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 10x and 16x. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of datacenters by 2.6x and 1.4x, respectively.
Matt Skach; Manish Arora; Chang-Hong Hsu; Qi Li; Dean Tullsen; Lingjia Tang; Jason Mars
Thermal time shifting: Leveraging phase change materials to reduce cooling costs in warehouse-scale computers Proceedings Article
In: Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 439–449, 2015.
@inproceedings{skach2015thermal,
title = {Thermal time shifting: Leveraging phase change materials to reduce cooling costs in warehouse-scale computers},
author = {Matt Skach and Manish Arora and Chang-Hong Hsu and Qi Li and Dean Tullsen and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/05/07284085.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture},
pages = {439--449},
abstract = {Datacenters, or warehouse scale computers, are rapidly increasing in size and power consumption. However, this growth comes at the cost of an increasing thermal load that must be removed to prevent overheating and server failure. In this paper, we propose to use phase changing materials (PCM) to shape the thermal load of a datacenter, absorbing and releasing heat when it is advantageous to do so. We present and validate a methodology to study the impact of PCM on a datacenter, and evaluate two important opportunities for cost savings. We find that in a datacenter with full cooling system subscription, PCM can reduce the necessary cooling system size by up to 12% without impacting peak throughput, or increase the number of servers by up to 14.6% without increasing the cooling load. In a thermally constrained setting, PCM can increase peak throughput up to 69% while delaying the onset of thermal limits by over 3 hours.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Johann Hauswald; Yiping Kang; Michael A Laurenzano; Quan Chen; Cheng Li; Trevor Mudge; Ronald G Dreslinski; Jason Mars; Lingjia Tang
DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers Proceedings Article
In: 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), pp. 27–40, IEEE 2015.
@inproceedings{hauswald2015djinn,
title = {DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers},
author = {Johann Hauswald and Yiping Kang and Michael A Laurenzano and Quan Chen and Cheng Li and Trevor Mudge and Ronald G Dreslinski and Jason Mars and Lingjia Tang},
url = {https://www.jasonmars.org/wp-content/uploads/2020/05/07284053.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)},
pages = {27--40},
organization = {IEEE},
abstract = {As applications such as Apple Siri, Google Now, Microsoft Cortana, and Amazon Echo continue to gain traction, webservice companies are adopting large deep neural networks (DNN) for machine learning challenges such as image processing, speech recognition, natural language processing, among others. A number of open questions arise as to the design of a server platform specialized for DNN and how modern warehouse scale computers (WSCs) should be outfitted to provide DNN as a service for these applications. In this paper, we present DjiNN, an open infrastructure for DNN as a service in WSCs, and Tonic Suite, a suite of 7 end-to-end applications that span image, speech, and language processing. We use DjiNN to design a high throughput DNN system based on massive GPU server designs and provide insights as to the varying characteristics across applications. After studying the throughput, bandwidth, and power properties of DjiNN and Tonic Suite, we investigate several design points for future WSC architectures. We investigate the total cost of ownership implications of having a WSC with a disaggregated GPU pool versus a WSC composed of homogeneous integrated GPU servers. We improve DNN throughput by over 120× for all but one application (40× for Facial Recognition) on an NVIDIA K40 GPU. On a GPU server composed of 8 NVIDIA K40s, we achieve near-linear scaling (around 1000× throughput improvement) for 3 of the 7 applications. Through our analysis, we also find that GPU-enabled WSCs improve total cost of ownership over CPU-only designs by 4-20×, depending on the composition of the workload.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}