2017
Skach, Matt; Aurora, Manish; Hsu, Chang-Hong; Li, Qi; Tullsen, Dean; Tang, Lingjia; Mars, Jason
Thermal time shifting: Decreasing datacenter cooling costs with phase change materials Journal Article
In: IEEE Internet Computing, 2017.
@article{skach2017thermal,
title = {Thermal time shifting: Decreasing datacenter cooling costs with phase change materials},
author = {Matt Skach and Manish Aurora and Chang-Hong Hsu and Qi Li and Dean Tullsen and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/2749469.2749474.pdf},
year = {2017},
date = {2017-01-01},
journal = {IEEE Internet Computing},
publisher = {IEEE},
abstract = {Datacenters, or warehouse scale computers, are rapidly increasing in size and power consumption. However, this growth comes at the cost of an increasing thermal load that must be removed to prevent overheating and server failure. In this paper, we propose to use phase changing materials (PCM) to shape the thermal load of a datacenter, absorbing and releasing heat when it is advantageous to do so. We present and validate a methodology to study the impact of PCM on a datacenter, and evaluate two important opportunities for cost savings. We find that in a datacenter with full cooling system subscription, PCM can reduce the necessary cooling system size by up to 12% without impacting peak throughput, or increase the number of servers by up to 14.6% without increasing the cooling load. In a thermally constrained setting, PCM can increase peak throughput up to 69% while delaying the onset of thermal limits by over 3 hours.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Hill, Parker; Jain, Animesh; Hill, Mason; Zamirai, Babak; Hsu, Chang-Hong; Laurenzano, Michael A; Mahlke, Scott; Tang, Lingjia; Mars, Jason
Deftnn: Addressing bottlenecks for dnn execution on gpus via synapse vector elimination and near-compute data fission Inproceedings
In: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 786–799, 2017.
@inproceedings{hill2017deftnn,
title = {Deftnn: Addressing bottlenecks for dnn execution on gpus via synapse vector elimination and near-compute data fission},
author = {Parker Hill and Animesh Jain and Mason Hill and Babak Zamirai and Chang-Hong Hsu and Michael A Laurenzano and Scott Mahlke and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3123939.3123970.pdf},
year = {2017},
date = {2017-01-01},
booktitle = {Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture},
pages = {786--799},
abstract = {Deep neural networks (DNNs) are key computational building blocks for emerging classes of web services that interact in real time with users via voice, images and video inputs. Although GPUs have gained popularity as a key accelerator platform for deep learning workloads, the increasing demand for DNN computation leaves a significant gap between the compute capabilities of GPU-enabled datacenters and the compute needed to service demand.
The state-of-the-art techniques to improve DNN performance have significant limitations in bridging the gap on real systems. Current network pruning techniques remove computation, but the resulting networks map poorly to GPU architectures, yielding no performance benefit or even slowdowns. Meanwhile, current bandwidth optimization techniques focus on reducing off-chip bandwidth while overlooking on-chip bandwidth, a key DNN bottleneck.
To address these limitations, this work introduces DeftNN, a GPU DNN execution framework that targets the key architectural bottlenecks of DNNs on GPUs to automatically and transparently improve execution performance. DeftNN is composed of two novel optimization techniques - (1) synapse vector elimination, a technique that identifies non-contributing synapses in the DNN and carefully transforms data and removes the computation and data movement of these synapses while fully utilizing the GPU to improve performance, and (2) near-compute data fission, a mechanism for scaling down the on-chip data movement requirements within DNN computations. Our evaluation of DeftNN spans 6 state-of-the-art DNNs. By applying both optimizations in concert, DeftNN is able to achieve an average speedup of 2.1X on real GPU hardware. We also introduce a small additional hardware unit per GPU core to facilitate efficient data fission operations, increasing the speedup achieved by DeftNN to 2.6X.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
The state-of-the-art techniques to improve DNN performance have significant limitations in bridging the gap on real systems. Current network pruning techniques remove computation, but the resulting networks map poorly to GPU architectures, yielding no performance benefit or even slowdowns. Meanwhile, current bandwidth optimization techniques focus on reducing off-chip bandwidth while overlooking on-chip bandwidth, a key DNN bottleneck.
To address these limitations, this work introduces DeftNN, a GPU DNN execution framework that targets the key architectural bottlenecks of DNNs on GPUs to automatically and transparently improve execution performance. DeftNN is composed of two novel optimization techniques - (1) synapse vector elimination, a technique that identifies non-contributing synapses in the DNN and carefully transforms data and removes the computation and data movement of these synapses while fully utilizing the GPU to improve performance, and (2) near-compute data fission, a mechanism for scaling down the on-chip data movement requirements within DNN computations. Our evaluation of DeftNN spans 6 state-of-the-art DNNs. By applying both optimizations in concert, DeftNN is able to achieve an average speedup of 2.1X on real GPU hardware. We also introduce a small additional hardware unit per GPU core to facilitate efficient data fission operations, increasing the speedup achieved by DeftNN to 2.6X.
2016
Jain, Animesh; Hill, Parker; Laurenzano, Michael A; Haque, Md E; Khan, Muneeb; Mahlke, Scott; Tang, Lingjia; Mars, Jason
CPSA: Compute precisely store approximately Inproceedings
In: Workshop on Approximate Computing Across the Stack, 2016.
@inproceedings{jain2016cpsa,
title = {CPSA: Compute precisely store approximately},
author = {Animesh Jain and Parker Hill and Michael A Laurenzano and Md E Haque and Muneeb Khan and Scott Mahlke and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/jain.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {Workshop on Approximate Computing Across the Stack},
abstract = {We propose a new approximate-computing paradigm, where computations are performed precisely while the data is stored approximately in the memory using data packing. This lets us reduce the memory traffic, improving application memory behavior. It achieves 85% memory savings for an accuracy target of 90%.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Hauswald, Johann; Laurenzano, Michael A; Zhang, Yunqi; Yang, Hailong; Kang, Yiping; Li, Cheng; Rovinski, Austin; Khurana, Arjun; Dreslinski, Ronald G; Mudge, Trevor; others,
Designing future warehouse-scale computers for Sirius, an end-to-end voice and vision personal assistant Journal Article
In: ACM Transactions on Computer Systems (TOCS), vol. 34, no. 1, pp. 1–32, 2016.
@article{hauswald2016designing,
title = {Designing future warehouse-scale computers for Sirius, an end-to-end voice and vision personal assistant},
author = {Johann Hauswald and Michael A Laurenzano and Yunqi Zhang and Hailong Yang and Yiping Kang and Cheng Li and Austin Rovinski and Arjun Khurana and Ronald G Dreslinski and Trevor Mudge and others},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/2870631.pdf},
year = {2016},
date = {2016-01-01},
journal = {ACM Transactions on Computer Systems (TOCS)},
volume = {34},
number = {1},
pages = {1--32},
publisher = {ACM New York, NY, USA},
abstract = {As user demand scales for intelligent personal assistants (IPAs) such as Apple’s Siri, Google’s Google Now, and Microsoft’s Cortana, we are approaching the computational limits of current datacenter (DC) architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this article, we present the design of Sirius, an open end-to-end IPA Web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of eight benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 8.5× and 15×, respectively. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of DCs by 2.3× and 1.3×, respectively.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Chen, Quan; Yang, Hailong; Mars, Jason; Tang, Lingjia
Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers Journal Article
In: ACM SIGPLAN Notices, vol. 51, no. 4, pp. 681–696, 2016.
@article{chen2016baymax,
title = {Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers},
author = {Quan Chen and Hailong Yang and Jason Mars and Lingjia Tang},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/2872362.2872368.pdf},
year = {2016},
date = {2016-01-01},
journal = {ACM SIGPLAN Notices},
volume = {51},
number = {4},
pages = {681--696},
publisher = {ACM New York, NY, USA},
abstract = {Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Zhang, Yunqi; Meisner, David; Mars, Jason; Tang, Lingjia
Treadmill: Attributing the source of tail latency through precise load testing and statistical inference Inproceedings
In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 456–468, IEEE 2016.
@inproceedings{zhang2016treadmill,
title = {Treadmill: Attributing the source of tail latency through precise load testing and statistical inference},
author = {Yunqi Zhang and David Meisner and Jason Mars and Lingjia Tang},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/ISCA.2016.47.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)},
pages = {456--468},
organization = {IEEE},
abstract = {Managing tail latency of requests has become one of the primary challenges for large-scale Internet services. Data centers are quickly evolving and service operators frequently desire to make changes to the deployed software and production hardware configurations. Such changes demand a confident understanding of the impact on one's service, in particular its effect on tail latency (e.g., 95th- or 99th-percentile response latency of the service). Evaluating the impact on the tail is challenging because of its inherent variability. Existing tools and methodologies for measuring these effects suffer from a number of deficiencies including poor load tester design, statistically inaccurate aggregation, and improper attribution of effects. As shown in the paper, these pitfalls can often result in misleading conclusions.
In this paper, we develop a methodology for statistically rigorous performance evaluation and performance factor attribution for server workloads. First, we find that careful design of the server load tester can ensure high quality performance evaluation, and empirically demonstrate the inaccuracy of load testers in previous work. Learning from the design flaws in prior work, we design and develop a modular load tester platform, Treadmill, that overcomes pitfalls of existing tools. Next, utilizing Treadmill, we construct measurement and analysis procedures that can properly attribute performance factors. We rely on statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the impact of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we develop a methodology for statistically rigorous performance evaluation and performance factor attribution for server workloads. First, we find that careful design of the server load tester can ensure high quality performance evaluation, and empirically demonstrate the inaccuracy of load testers in previous work. Learning from the design flaws in prior work, we design and develop a modular load tester platform, Treadmill, that overcomes pitfalls of existing tools. Next, utilizing Treadmill, we construct measurement and analysis procedures that can properly attribute performance factors. We rely on statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the impact of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%.
Laurenzano, Michael A; Zhang, Yunqi; Chen, Jiang; Tang, Lingjia; Mars, Jason
Powerchop: Identifying and managing non-critical units in hybrid processor architectures Inproceedings
In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 140–152, IEEE 2016.
@inproceedings{laurenzano2016powerchop,
title = {Powerchop: Identifying and managing non-critical units in hybrid processor architectures},
author = {Michael A Laurenzano and Yunqi Zhang and Jiang Chen and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3007787.3001152.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)},
pages = {140--152},
organization = {IEEE},
abstract = {On-core microarchitectural structures consume significant portions of a processor's power budget. However, depending on application characteristics, those structures do not always provide (much) performance benefit. While timeout-based power gating techniques have been leveraged for underutilized cores and inactive functional units, these techniques have not directly translated to high-activity units such as vector processing units, complex branch predictors, and caches. The performance benefit provided by these units does not necessarily correspond with unit activity, but instead is a function of application characteristics.
This work introduces PowerChop, a novel technique that leverages the unique capabilities of HW/SW co-designed hybrid processors to enact unit-level power management at the application phase level. PowerChop adds two small additional hardware units to facilitate phase identification and triggering different power states, enabling the software layer to cheaply track, predict and take advantage of varying unit criticality across application phases by powering gating units that are not needed for performant execution. Through detailed experimentation, we find that PowerChop significantly decreases power consumption, reducing the leakage power of a hybrid server processor by 9% on average (up to 33%) and a hybrid mobile processor by 19% (up to 40%) while introducing just 2% slowdown.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This work introduces PowerChop, a novel technique that leverages the unique capabilities of HW/SW co-designed hybrid processors to enact unit-level power management at the application phase level. PowerChop adds two small additional hardware units to facilitate phase identification and triggering different power states, enabling the software layer to cheaply track, predict and take advantage of varying unit criticality across application phases by powering gating units that are not needed for performant execution. Through detailed experimentation, we find that PowerChop significantly decreases power consumption, reducing the leakage power of a hybrid server processor by 9% on average (up to 33%) and a hybrid mobile processor by 19% (up to 40%) while introducing just 2% slowdown.
Laurenzano, Michael A; Hill, Parker; Samadi, Mehrzad; Mahlke, Scott; Mars, Jason; Tang, Lingjia
Input responsiveness: using canary inputs to dynamically steer approximation Inproceedings
In: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 161–176, 2016.
@inproceedings{laurenzano2016input,
title = {Input responsiveness: using canary inputs to dynamically steer approximation},
author = {Michael A Laurenzano and Parker Hill and Mehrzad Samadi and Scott Mahlke and Jason Mars and Lingjia Tang},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/2908080.2908087.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation},
pages = {161--176},
abstract = {This paper introduces Input Responsive Approximation (IRA), an approach that uses a canary input — a small program input carefully constructed to capture the intrinsic properties of the original input — to automatically control how program approximation is applied on an input-by-input basis. Motivating this approach is the observation that many of the prior techniques focusing on choosing how to approximate arrive at conservative decisions by discounting substantial differences between inputs when applying approximation. The main challenges in overcoming this limitation lie in making the choice of how to approximate both effectively (e.g., the fastest approximation that meets a particular accuracy target) and rapidly for every input. With IRA, each time the approximate program is run, a canary input is constructed and used dynamically to quickly test a spectrum of approximation alternatives. Based on these runtime tests, the approximation that best fits the desired accuracy constraints is selected and applied to the full input to produce an approximate result. We use IRA to select and parameterize mixes of four approximation techniques from the literature for a range of 13 image processing, machine learning, and data mining applications. Our results demonstrate that IRA significantly outperforms prior approaches, delivering an average of 10.2× speedup over exact execution while minimizing accuracy losses in program outputs.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Jain, Animesh; Laurenzano, Michael A; Tang, Lingjia; Mars, Jason
Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting Inproceedings
In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, IEEE 2016.
@inproceedings{jain2016continuous,
title = {Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting},
author = {Animesh Jain and Michael A Laurenzano and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3195638.3195666.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)},
pages = {1--12},
organization = {IEEE},
abstract = {The class of optimizations characterized by manipulating a loop's interaction space for improved cache locality and reuse (i.e, cache tiling / blocking / strip mine and interchange) are static optimizations requiring a priori information about the microarchitectural and runtime environment of an application binary. However, particularly in datacenter environments, deployed applications face numerous dynamic environments over their lifetimes. As a result, this class of optimizations can result in sub-optimal performance due to the inability to flexibly adapt iteration spaces as cache conditions change at runtime.
This paper introduces continuous shape shifiting, a compilation approach that removes the risks of cache tiling optimizations by dynamically rewriting (and reshaping) deployed, running application code. To realize continuous shape shifting, we present ShapeShifter, a framework for continuous monitoring of co-running applications and their runtime environments to reshape loop iteration spaces and pinpoint near-optimal loop tile configurations. Upon identifying a need for reshaping, a new tiling approach is quickly constructed for the application, new code is dynamically generated and is then seamlessly stitched into the running application with near-zero overhead. Our evaluation on a wide spectrum of runtime scenarios demonstrates that ShapeShifter achieves an average of 10--40% performance improvement (up to 2.4X) on real systems depending on the runtime environment compared to an oracle static loop tiling baseline.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper introduces continuous shape shifiting, a compilation approach that removes the risks of cache tiling optimizations by dynamically rewriting (and reshaping) deployed, running application code. To realize continuous shape shifting, we present ShapeShifter, a framework for continuous monitoring of co-running applications and their runtime environments to reshape loop iteration spaces and pinpoint near-optimal loop tile configurations. Upon identifying a need for reshaping, a new tiling approach is quickly constructed for the application, new code is dynamically generated and is then seamlessly stitched into the running application with near-zero overhead. Our evaluation on a wide spectrum of runtime scenarios demonstrates that ShapeShifter achieves an average of 10--40% performance improvement (up to 2.4X) on real systems depending on the runtime environment compared to an oracle static loop tiling baseline.
Jain, Animesh; Hill, Parker; Lin, Shih-Chieh; Khan, Muneeb; Haque, Md E; Laurenzano, Michael A; Mahlke, Scott; Tang, Lingjia; Mars, Jason
Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation Inproceedings
In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, IEEE 2016.
@inproceedings{jain2016concise,
title = {Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation},
author = {Animesh Jain and Parker Hill and Shih-Chieh Lin and Muneeb Khan and Md E Haque and Michael A Laurenzano and Scott Mahlke and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3195638.3195688.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)},
pages = {1--13},
organization = {IEEE},
abstract = {Cache capacity and memory bandwidth play critical roles in application performance, particularly for data-intensive applications from domains that include machine learning, numerical analysis, and data mining. Many of these applications are also tolerant to imprecise inputs and have loose constraints on the quality of output, making them ideal candidates for approximate computing. This paper introduces a novel approximate computing technique that decouples the format of data in the memory hierarchy from the format of data in the compute subsystem to significantly reduce the cost of storing and moving bits throughout the memory hierarchy and improve application performance. This asymmetric compute-memory extension to conventional architectures, ACME, adds two new instruction classes to the ISA - load-concise and store-concise - along with three small functional units to the micro-architecture to support these instructions. ACME does not affect exact execution of applications and comes into play only when concise memory operations are used. Through detailed experimentation we find that ACME is very effective at trading result accuracy for improved application performance. Our results show that ACME achieves a 1.3X speedup (up to 1.8X) while maintaining 99% accuracy, or a 1.1X speedup while maintaining 99.999% accuracy. Moreover, our approach incurs negligible area and power overheads, adding just 0.005% area and 0.1% power to a conventional modern architecture.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Zekany, Stephen; Rings, Daniel; Harada, Nathan; Laurenzano, Michael A; Tang, Lingjia; Mars, Jason
CrystalBall: Statically analyzing runtime behavior via deep sequence learning Inproceedings
In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, IEEE 2016.
@inproceedings{zekany2016crystalball,
title = {CrystalBall: Statically analyzing runtime behavior via deep sequence learning},
author = {Stephen Zekany and Daniel Rings and Nathan Harada and Michael A Laurenzano and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/3195638.3195667.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)},
pages = {1--12},
organization = {IEEE},
abstract = {Understanding dynamic program behavior is critical in many stages of the software development lifecycle, for purposes as diverse as optimization, debugging, testing, and security. This paper focuses on the problem of predicting dynamic program behavior statically. We introduce a novel technique to statically identify hot paths that leverages emerging deep learning techniques to take advantage of their ability to learn subtle, complex relationships between sequences of inputs. This approach maps well to the problem of identifying the behavior of sequences of basic blocks in program execution. Our technique is also designed to operate on the compiler's intermediate representation (IR), as opposed to the approaches taken by prior techniques that have focused primarily on source code, giving our approach language-independence. We describe the pitfalls of conventional metrics used for hot path prediction such as accuracy, and motivate the use of Area Under the Receiver Operating Characteristic curve (AUROC). Through a thorough evaluation of our technique on complex applications that include the SPEC CPU2006 benchmarks, we show that our approach achieves an AUROC of 0.85.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Hauswald, Johann; Laurenzano, Michael A; Zhang, Yunqi; Li, Cheng; Rovinski, Austin; Khurana, Arjun; Dreslinski, Ronald G; Mudge, Trevor; Petrucci, Vinicius; Tang, Lingjia; others,
Sirius implications for future warehouse-scale computers Journal Article
In: IEEE Micro, vol. 36, no. 3, pp. 42–53, 2016.
@article{hauswald2016sirius,
title = {Sirius implications for future warehouse-scale computers},
author = {Johann Hauswald and Michael A Laurenzano and Yunqi Zhang and Cheng Li and Austin Rovinski and Arjun Khurana and Ronald G Dreslinski and Trevor Mudge and Vinicius Petrucci and Lingjia Tang and others},
url = {https://www.jasonmars.org/wp-content/uploads/2020/04/07478443.pdf},
year = {2016},
date = {2016-01-01},
journal = {IEEE Micro},
volume = {36},
number = {3},
pages = {42--53},
publisher = {IEEE},
abstract = {Demand is expected to grow significantly for cloud services that deliver sophisticated artificial intelligence on the critical path of user queries, as is the case with intelligent personal assistants such as Apple's Siri. If the prediction of the trend is correct, these types of applications will likely consume most of the world's computing cycles. The Sirius project was motivated to investigate what this future might look like and how cloud architectures should evolve to achieve it.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
2015
Petrucci, Vinicius; Laurenzano, Michael A; Doherty, John; Zhang, Yunqi; Mosse, Daniel; Mars, Jason; Tang, Lingjia
Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers Inproceedings
In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 246–258, IEEE 2015.
@inproceedings{petrucci2015octopus,
title = {Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers},
author = {Vinicius Petrucci and Michael A Laurenzano and John Doherty and Yunqi Zhang and Daniel Mosse and Jason Mars and Lingjia Tang},
url = {https://www.jasonmars.org/wp-content/uploads/2020/05/07056037.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)},
pages = {246--258},
organization = {IEEE},
abstract = {Heterogeneous multicore architectures have the potential to improve energy efficiency by integrating power-efficient wimpy cores with high-performing brawny cores. However, it is an open question as how to deliver energy reduction while ensuring the quality of service (QoS) of latency-sensitive web-services running on such heterogeneous multicores in warehouse-scale computers (WSCs). In this work, we first investigate the implications of heterogeneous multicores in WSCs and show that directly adopting heterogeneous multicores without re-designing the software stack to provide QoS management leads to significant QoS violations. We then present Octopus-Man, a novel QoS-aware task management solution that dynamically maps latency-sensitive tasks to the least power-hungry processing resources that are sufficient to meet the QoS requirements. Using carefully-designed feedback-control mechanisms, Octopus-Man addresses critical challenges that emerge due to uncertainties in workload fluctuations and adaptation dynamics in a real system. Our evaluation using web-search and memcached running on a real-system Intel heterogeneous prototype demonstrates that Octopus-Man improves energy efficiency by up to 41% (CPU power) and up to 15% (system power) over an all-brawny WSC design while adhering to specified QoS targets.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Hsu, Chang-Hong; Zhang, Yunqi; Laurenzano, Michael A; Meisner, David; Wenisch, Thomas; Mars, Jason; Tang, Lingjia; Dreslinski, Ronald G
Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting Inproceedings
In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 271–282, IEEE 2015.
@inproceedings{hsu2015adrenaline,
title = {Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting},
author = {Chang-Hong Hsu and Yunqi Zhang and Michael A Laurenzano and David Meisner and Thomas Wenisch and Jason Mars and Lingjia Tang and Ronald G Dreslinski},
url = {https://www.jasonmars.org/wp-content/uploads/2020/05/07056039.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)},
pages = {271--282},
organization = {IEEE},
abstract = {Reducing the long tail of the query latency distribution in modern warehouse scale computers is critical for improving performance and quality of service of workloads such as Web Search and Memcached. Traditional turbo boost increases a processor's voltage and frequency during a coarse-grain sliding window, boosting all queries that are processed during that window. However, the inability of such a technique to pinpoint tail queries for boosting limits its tail reduction benefit. In this work, we propose Adrenaline, an approach to leverage finer granularity, 10's of nanoseconds, voltage boosting to effectively rein in the tail latency with query-level precision. Two key insights underlie this work. First, emerging finer granularity voltage/frequency boosting is an enabling mechanism for intelligent allocation of the power budget to precisely boost only the queries that contribute to the tail latency; and second, per-query characteristics can be used to design indicators for proactively pinpointing these queries, triggering boosting accordingly. Based on these insights, Adrenaline effectively pinpoints and boosts queries that are likely to increase the tail distribution and can reap more benefit from the voltage/frequency boost. By evaluating under various workload configurations, we demonstrate the effectiveness of our methodology. We achieve up to a 2.50x tail latency improvement for Memcached and up to a 3.03x for Web Search over coarse-grained DVFS given a fixed boosting power budget. When optimizing for energy reduction, Adrenaline achieves up to a 1.81x improvement for Memcached and up to a 1.99x for Web Search over coarse-grained DVFS.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Hauswald, Johann; Laurenzano, Michael A; Zhang, Yunqi; Li, Cheng; Rovinski, Austin; Khurana, Arjun; Dreslinski, Ronald G; Mudge, Trevor; Petrucci, Vinicius; Tang, Lingjia; others,
Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers Inproceedings
In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 223–238, 2015.
@inproceedings{hauswald2015sirius,
title = {Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers},
author = {Johann Hauswald and Michael A Laurenzano and Yunqi Zhang and Cheng Li and Austin Rovinski and Arjun Khurana and Ronald G Dreslinski and Trevor Mudge and Vinicius Petrucci and Lingjia Tang and others},
url = {https://www.jasonmars.org/wp-content/uploads/2020/05/2694344.2694347.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {223--238},
abstract = {As user demand scales for intelligent personal assistants (IPAs) such as Apple's Siri, Google's Google Now, and Microsoft's Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this paper, we present the design of Sirius, an open end-to-end IPA web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs.
To investigate future server designs for Sirius, we decompose Sirius into a suite of 7 benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 10x and 16x. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of datacenters by 2.6x and 1.4x, respectively.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
To investigate future server designs for Sirius, we decompose Sirius into a suite of 7 benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 10x and 16x. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of datacenters by 2.6x and 1.4x, respectively.
Skach, Matt; Arora, Manish; Hsu, Chang-Hong; Li, Qi; Tullsen, Dean; Tang, Lingjia; Mars, Jason
Thermal time shifting: Leveraging phase change materials to reduce cooling costs in warehouse-scale computers Inproceedings
In: Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 439–449, 2015.
@inproceedings{skach2015thermal,
title = {Thermal time shifting: Leveraging phase change materials to reduce cooling costs in warehouse-scale computers},
author = {Matt Skach and Manish Arora and Chang-Hong Hsu and Qi Li and Dean Tullsen and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/05/07284085.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture},
pages = {439--449},
abstract = {Datacenters, or warehouse scale computers, are rapidly increasing in size and power consumption. However, this growth comes at the cost of an increasing thermal load that must be removed to prevent overheating and server failure. In this paper, we propose to use phase changing materials (PCM) to shape the thermal load of a datacenter, absorbing and releasing heat when it is advantageous to do so. We present and validate a methodology to study the impact of PCM on a datacenter, and evaluate two important opportunities for cost savings. We find that in a datacenter with full cooling system subscription, PCM can reduce the necessary cooling system size by up to 12% without impacting peak throughput, or increase the number of servers by up to 14.6% without increasing the cooling load. In a thermally constrained setting, PCM can increase peak throughput up to 69% while delaying the onset of thermal limits by over 3 hours.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Hauswald, Johann; Kang, Yiping; Laurenzano, Michael A; Chen, Quan; Li, Cheng; Mudge, Trevor; Dreslinski, Ronald G; Mars, Jason; Tang, Lingjia
DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers Inproceedings
In: 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), pp. 27–40, IEEE 2015.
@inproceedings{hauswald2015djinn,
title = {DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers},
author = {Johann Hauswald and Yiping Kang and Michael A Laurenzano and Quan Chen and Cheng Li and Trevor Mudge and Ronald G Dreslinski and Jason Mars and Lingjia Tang},
url = {https://www.jasonmars.org/wp-content/uploads/2020/05/07284053.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)},
pages = {27--40},
organization = {IEEE},
abstract = {As applications such as Apple Siri, Google Now, Microsoft Cortana, and Amazon Echo continue to gain traction, webservice companies are adopting large deep neural networks (DNN) for machine learning challenges such as image processing, speech recognition, natural language processing, among others. A number of open questions arise as to the design of a server platform specialized for DNN and how modern warehouse scale computers (WSCs) should be outfitted to provide DNN as a service for these applications. In this paper, we present DjiNN, an open infrastructure for DNN as a service in WSCs, and Tonic Suite, a suite of 7 end-to-end applications that span image, speech, and language processing. We use DjiNN to design a high throughput DNN system based on massive GPU server designs and provide insights as to the varying characteristics across applications. After studying the throughput, bandwidth, and power properties of DjiNN and Tonic Suite, we investigate several design points for future WSC architectures. We investigate the total cost of ownership implications of having a WSC with a disaggregated GPU pool versus a WSC composed of homogeneous integrated GPU servers. We improve DNN throughput by over 120× for all but one application (40× for Facial Recognition) on an NVIDIA K40 GPU. On a GPU server composed of 8 NVIDIA K40s, we achieve near-linear scaling (around 1000× throughput improvement) for 3 of the 7 applications. Through our analysis, we also find that GPU-enabled WSCs improve total cost of ownership over CPU-only designs by 4-20×, depending on the composition of the workload.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Khan, Muneeb; Laurenzanoy, Michael A; Marsy, Jason; Hagersten, Erik; Black-Schaffer, David
AREP: Adaptive resource efficient prefetching for maximizing multicore performance Inproceedings
In: 2015 International Conference on Parallel Architecture and Compilation (PACT), pp. 367–378, IEEE 2015.
@inproceedings{khan2015arep,
title = {AREP: Adaptive resource efficient prefetching for maximizing multicore performance},
author = {Muneeb Khan and Michael A Laurenzanoy and Jason Marsy and Erik Hagersten and David Black-Schaffer},
url = {https://www.jasonmars.org/wp-content/uploads/2020/05/07429320.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {2015 International Conference on Parallel Architecture and Compilation (PACT)},
pages = {367--378},
organization = {IEEE},
abstract = {Modern processors widely use hardware prefetching to hide memory latency. While aggressive hardware prefetchers can improve performance significantly for some applications, they can limit the overall performance in highly-utilized multicore processors by saturating the offchip bandwidth and wasting last-level cache capacity. Co-executing applications can slowdown due to contention over these shared resources. This work introduces Adaptive Resource Efficient Prefetching (AREP) -- a runtime framework that dynamically combines software prefetching and hardware prefetching to maximize throughput in highly utilized multicore processors. AREP achieves better performance by prefetching data in a resource efficient way -- conserving offchip-bandwidth and last-level cache capacity with accurate prefetching and by applying cache-bypassing when possible. AREP dynamically explores a mix of hardware/software prefetching policies, then selects and applies the best performing policy. AREP is phase-aware and re-explores (at runtime) for the best prefetching policy at phase boundaries. A multitude of experiments with workload mixes and parallel applications on a modern high performance multicore show that AREP can increase throughput by up to 49% (8.1% on average). This is complemented by improved fairness, resulting in average quality of service above 94%.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
2014
Zhai, Yan; Zhang, Xiao; Eranian, Stephane; Tang, Lingjia; Mars, Jason
Happy: Hyperthread-aware power profiling dynamically Inproceedings
In: 2014 USENIX Annual Technical Conference (USENIX ATC 2014), pp. 211–217, 2014.
@inproceedings{zhai2014happy,
title = {Happy: Hyperthread-aware power profiling dynamically},
author = {Yan Zhai and Xiao Zhang and Stephane Eranian and Lingjia Tang and Jason Mars},
url = {https://www.jasonmars.org/wp-content/uploads/2020/05/atc14-paper-zhai.pdf},
year = {2014},
date = {2014-01-01},
booktitle = {2014 USENIX Annual Technical Conference (USENIX ATC 2014)},
pages = {211--217},
abstract = {Quantifying the power consumption of individual applications co-running on a single server is a critical component for software-based power capping, scheduling, and provisioning techniques in modern datacenters. However, with the proliferation of hyperthreading in the last few generations of server-grade processor designs, the challenge of accurately and dynamically performing this power attribution to individual threads has been significantly exacerbated. Due to the sharing of core-level resources such as functional units, prior techniques are not suitable to attribute the power consumption between hyperthreads sharing a physical core.
In this paper, we present a runtime mechanism that quantifies and attributes power consumption to individual jobs at fine granularity. Specifically, we introduce a hyperthread-aware power model that differentiates between the states when both hardware threads of a core are in use, and when only one thread is in use. By capturing these two different states, we are able to accurately attribute power to each logical CPU in modern servers. We conducted experiments with several Google production workloads on an Intel Sandy Bridge server. Compared to prior hyperthread-oblivious model, HaPPy is substantially more accurate, reducing the prediction error from 20.5% to 7.5% on average and from 31.5% to 9.4% in the worst case.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we present a runtime mechanism that quantifies and attributes power consumption to individual jobs at fine granularity. Specifically, we introduce a hyperthread-aware power model that differentiates between the states when both hardware threads of a core are in use, and when only one thread is in use. By capturing these two different states, we are able to accurately attribute power to each logical CPU in modern servers. We conducted experiments with several Google production workloads on an Intel Sandy Bridge server. Compared to prior hyperthread-oblivious model, HaPPy is substantially more accurate, reducing the prediction error from 20.5% to 7.5% on average and from 31.5% to 9.4% in the worst case.
Zhang, Yunqi; Laurenzano, Michael A; Mars, Jason; Tang, Lingjia
Smite: Precise qos prediction on real-system smt processors to improve utilization in warehouse scale computers Inproceedings
In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 406–418, IEEE 2014.
@inproceedings{zhang2014smite,
title = {Smite: Precise qos prediction on real-system smt processors to improve utilization in warehouse scale computers},
author = {Yunqi Zhang and Michael A Laurenzano and Jason Mars and Lingjia Tang},
url = {https://www.jasonmars.org/wp-content/uploads/2020/05/07011405.pdf},
year = {2014},
date = {2014-01-01},
booktitle = {2014 47th Annual IEEE/ACM International Symposium on Microarchitecture},
pages = {406--418},
organization = {IEEE},
abstract = {One of the key challenges for improving efficiency in warehouse scale computers (WSCs) is to improve server utilization while guaranteeing the quality of service (QoS) of latency-sensitive applications. To this end, prior work has proposed techniques to precisely predict performance and QoS interference to identify 'safe' application co-locations. However, such techniques are only applicable to resources shared across cores. Achieving such precise interference prediction on real-system simultaneous multithreading (SMT) architectures has been a significantly challenging open problem due to the complexity introduced by sharing resources within a core. In this paper, we demonstrate through a real-system investigation that the fundamental difference between resource sharing behaviors on CMP and SMT architectures calls for a redesign of the way we model interference. For SMT servers, the interference on different shared resources, including private caches, memory ports, as well as integer and floating-point functional units, do not correlate with each other. This insight suggests the necessity of decoupling interference into multiple resource sharing dimensions. In this work, we propose SMiTe, a methodology that enables precise performance prediction for SMT co-location on real-system commodity processors. With a set of Rulers, which are carefully designed software stressors that apply pressure to a multidimensional space of shared resources, we quantify application sensitivity and contentiousness in a decoupled manner. We then establish a regression model to combine the sensitivity and contentiousness in different dimensions to predict performance interference. Using this methodology, we are able to precisely predict the performance interference in SMT co-location with an average error of 2.80% on SPEC CPU2006 and 1.79% on Cloud Suite. Our evaluation shows that SMiTe allows us to improve the utilization of WSCs by up to 42.57% while enforcing an application's QoS requirements.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}