Published Papers

First Next Last

Page 1 of 2

    Bibtex for ANGEL: A Hierarchical Approach to Multi-Objective Online Auto-Tuning:

    @inproceedings{
     	author = "Ray Chen and Jeffrey K. Hollingsworth",
     	title = "ANGEL: A Hierarchical Approach to Multi-Objective Online Auto-Tuning",
     	month = "06",
     	year = "2015",
    	booktitle=Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers,
    	series=ROSS '15,
    	year=2015,
    	isbn=978-1-4503-3606-2,
    }

  • Ray Chen and Jeffrey K. Hollingsworth, ANGEL: A Hierarchical Approach to Multi-Objective Online Auto-Tuning, Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers, June, 2015, PDF linkBibTeX
  • Jeffrey K. Hollingsworth and Sukhyun Song, Designing and Auto-Tuning Parallel 3-D FFT for Computation-Communication Overlap, Feb. 2014 PPOPP'14, PDF link
  • Jeffrey K. Hollingsworth and Ray Chen, Towards Fully Automatic Auto-tuning: Leveraging language features of Chapel, Nov. 2013 International Journal of High Performance Computing Applications, 27(4), PDF link
  • Ananta Tiwari, Vamsi Sripathi, Sarat Sreepathi, Shreyas Ramalingam, Gabriel Marin, Kumar Mahinthakumar, Jeffrey K. Hollingsworth, Mary Hall, Chun Chen and Jacqueline Chame, PERI Autotuning of PFLOTRAN, July 2011 Journal of Physics, (to appear) Proceedings of SciDAC 2011,
  • Jeffrey K. Hollingsworth and Ananta Tiwari, Online Adaptive Code Generation and Tuning, May 2011 IPDPS, PDF link
  • Jacqueline Chame, Dan Quinlin, C Liao, Mary Hall, Chun Chen, Jeffrey K. Hollingsworth and Ananta Tiwari, Auto-tuning Full Applications: A Case Study, July 2011 International Journal of High Performance Computing, PDF link
  • Jeffrey K. Hollingsworth and Ananta Tiwari, End-to-end Auto-tuning with Active Harmony, 2010 in Performance Tuning in Scientific Computing, D. Bailey & S. Williams, ed.,
  • Jeffrey K. Hollingsworth, Vahid Tabatabaee and Ananta Tiwari, Tuning Parallel Applications in Parallel, Aug 2009 Parallel Computing,
  • Jeffrey K. Hollingsworth, Mary Hall, Jacqueline Chame, Chun Chen and Ananta Tiwari, A Scalable Autotuning Framework for Compiler Optimization, May 2009 IPDPS 2009, Rome, PDF link
  • Haihang You, Sam Williams, Ananta Tiwari, Jaewook Shin, Keith Seymour, Shirely Moore, Paul Hovland, Jeffrey K. Hollingsworth, Mary Hall, Jack Dongarra, Chun Chen, Jacqueline Chame and David Bailey, PERI Auto-Tuning, Nov. 2008 Journal of Physics: Conference Series 125 (2008), PDF link
  • Jeffrey K. Hollingsworth and Vahid Tabatabaee, Automatic Software Interference Detection in Parallel Applications, Nov. 2007 SC'07, Reno, NV, PDF link
  • Jeffrey K. Hollingsworth and I-Hsin Chung, A Case Study Using Automatic Performance Tuning for Large-Scale Scientific Programs, June 2006 International Symposium on High Performance Distributed Computing (HPDC), Paris, PDF link
  • Jeffrey K. Hollingsworth, Ananta Tiwari and Vahid Tabatabaee, Parallel Parameter Tuning for Applications with Performance Variability, November 2005 SC'05, Seattle WA, PDF link
  • Jeffrey K. Hollingsworth and I-Hsin Chung, Automated Cluster-Based Web Service Performance Tuning, June 2004 Proceedings of IEEE Conference on High Performance Distributed Computing (HPDC), PDF link
  • Jeffrey K. Hollingsworth and I-Hsin Chung, Using Information from Prior Runs to Improve Automated Tuning Systems, November 2004 Proceedings of SuperComputing, PDF link
  • I-Hsin Chung, Towards Automatic Performance Tuning, December 2004 Ph.D. Dissertation, University of Maryland, PDF link
  • Jeffrey K. Hollingsworth and I-Hsin Chung, Runtime Selection Among Different API Implementations, June 2003 Parallel Processing Letters Vol. 13, No. 2,
  • Jeffrey K. Hollingsworth, I-Hsin Chung and Cristian Tapus, Active Harmony: Towards Automated Performance Tuning, November 2002 Proceedings of SuperComputing, PDF link
  • Kyung Dong Ryu, Exploiting Idle Cycles In Networks Of Workstations, August 2001 Ph.D. Disseration, University of Maryland, PDF link
  • Peter Keleher and Jeffrey K. Hollingsworth, Prediction and Adaptation in Active Harmony, July 1999 Cluster Computing, 2 (1999),
  • Dejan Perkovic, Jeffrey K. Hollingsworth and Peter Keleher, Exposing Application Alternatives, June 1999 International Conference on Distributed Systems (ICDCS), PDF link
  • Peter Keleher, Jeffrey K. Hollingsworth and Kyung Dong Ryu, Mechanisms and Policies for Supporting Fine-Grained Cycle Stealing, June 1999 Internation Conference on Supercomputing (ICS), PDF link
  • Peter Keleher and Jeffrey K. Hollingsworth, Prediction and Adaptation in Active Harmony, July 1998 IEEE International Symposium on High Performance Distributed Computing (HPDC), PDF link
    Abstract for ANGEL: A Hierarchical Approach to Multi-Objective Online Auto-Tuning:

    The majority of auto-tuning research in HPC to date has focused on improving the performance of a single objective function. More recently, interest has grown in optimizing multiple objective functions simultaneously, for example balancing the trade-off between execution time and energy consumption. Evolutionary algorithms for multi-objective optimization attempt to quickly and accurately discover the set of Pareto-optimal solutions, or the Pareto frontier. However, a single solution (as opposed to a set) is desirable for online end-to-end auto-tuning. As an alternative, we use the hierarchical method based on prioritizing the objective functions and iteratively optimizing them in isolation to efficiently produce a single Pareto-optimal result. This approach has several advantages over evolutionary or even scalarization techniques for auto-tuning HPC problems. In this paper, we introduce the Automated Navigation Given Enumerated Leeways (ANGEL) auto-tuning method. We demonstrate the quality of ANGEL using a multi-objective benchmark test suite. We also demonstrate its utility by approximating various sections of a Pareto front from a real-world proxy application. ANGEL successfully navigates trade-offs that range from a 15% reduction in kernel run-time to a 3% savings in energy us.

    Abstract for Designing and Auto-Tuning Parallel 3-D FFT for Computation-Communication Overlap:

    This paper presents a method to design and auto-tune a new parallel 3-D FFT code using the non-blocking MPI all-to-all operation. We achieve high performance by optimizing computation-communication overlap. Our code performs fully asynchronous communication without any support from special hardware. We also improve cache perfor- mance through loop tiling. To cope with the complex tradeoff regarding our optimization techniques, we parameterize our code and auto-tune the parameters efficiently in a large parameter space. Experimental results from two systems confirm that our code achieves a speedup of up to 1.76× over the FFTW library.

    Abstract for Towards Fully Automatic Auto-tuning: Leveraging language features of Chapel:

    Application auto-tuning has produced excellent results in a wide range of computing domains. Yet adapting an application to use this technology remains a predominately manual and labor intensive process. This paper explores first steps towards reducing adoption cost by focusing on two tasks: parameter identification and range selection. We show how these traditionally manual tasks can be automated in the context of Chapel, a parallel programming language developed by Cray Inc. Potential auto-tuning parameters may be inferred from existing Chapel applications by leveraging features unique to this language. After verification, these parameters may then be passed to an auto-tuner for an automatic search of the induced parameter space. To further automate adoption, we also present Tuna: an auto-tuning shell designed to tune applications by manipulating their command-line arguments. Finally, we demonstrate the immediate utility of this system by tuning two Chapel applications with little or no internal knowledge of the program source.

    Abstract for Online Adaptive Code Generation and Tuning:

    In this paper, we present a runtime compilation and tuning framework for parallel programs. We extend our prior work on our auto-tuner, Active Harmony, for tunable parameters that require code generation(for example, different unroll factors). For such parameters, our auto-tuner generates and compiles new code on-the-fly. Effectively, we merge traditional feedback directed optimization and just-in-time compilation. We show that our system can leverage available parallelism in today's HPC platforms by evaluating different code-variants on different nodes simultaneously. We evaluate our system on two parallel applications and show that our system can improve runtime execution by up to 46% compared to the original version of the program.

    Abstract for Auto-tuning Full Applications: A Case Study:

    In this paper, we take a concrete step towards materializing our long-term goal of providing a fully automatic end-to-end tuning infrastructure for arbitrary program components and full applications. We describe a general-purpose offline auto-tuning framework and apply it to an application benchmark, SMG2000, a semi-coarsening multigrid on structured grids. We show that the proposed system first extracts computationally-intensive loop nests into separate executable functions, a code transformation called outlining. The outlined loop nests are then tuned by the framework and subsequently integrated back into the application. Each loop nest is optimized through a series of composable code transformations, with the transformations parameterized by unbound optimization parameters that are bound during the tuning process. The values for these parameters are selected using a search-based auto-tuner, which performs a parallel heuristic search for the best-performing optimized variants of the outlined loop nests. We show that our system pinpoints a code variant that performs 2.37 times faster than the original loop nest. When the full application is run using the code variant found by the system, the application's performance improves by 27%.

    Abstract for Tuning Parallel Applications in Parallel:

    In this paper, we present and evaluate a parallel algorithm for parameter tuning of parallel applications. We discuss the impact of performance variability on the accuracy and efficiency of the optimization algorithm and propose a strategy to minimize the impact of this variability. We evaluate our algorithm within the Active Harmony system, an automated online/offline tuning framework. We study its performance on three benchmark codes: PSTSWM, HPL and POP. Compared to the Nelder-Mead algorithm, our algorithm finds better configurations up to 7 times faster. For POP, we were able to improve the performance of a production sized run by 59%.

    Abstract for A Scalable Autotuning Framework for Compiler Optimization:

    We describe a scalable and general-purpose framework for autotuning compiler-generated code. We combine Active Harmony's parallel search backend with the CHiLL compiler transformation framework to generate in parallel a set of alternative implementations of computation kernels and automatically select the one with the best-performing implementation. The resulting system achieves performance of compiler-generated code comparable to the fully automated version of the ATLAS library for the tested kernels. Performance for various kernels is 1.4 to 3.6 times faster than the native Intel compiler without search. Our search algorithm simultaneously evaluates di erent combinations of compiler optimizations and converges to solutions in a few tens of search-steps.

    Abstract for PERI Auto-Tuning:

    The enormous and growing complexity of today's high-end systems has increased the already significant challenges of obtaining high performance on equally complex scientific applications. Application scientists are faced with a daunting challenge in tuning their codes to exploit performance-enhancing architectural features. The Performance Engineering Research Institute (PERI) is working toward the goal of automating portions of the performance tuning process. This paper describes PERI's overall strategy for auto-tuning tools and recent progress in both building auto-tuning tools and demonstrating their success on kernels, some taken from large-scale applications.

    Abstract for Automatic Software Interference Detection in Parallel Applications:

    We present an automated software interference detection methodology for Single Program, Multiple Data (SPMD) parallel applications. Interference comes from the system and unexpected processes. If not detected and corrected such interference may result in performance degradation. Our goal is to provide a reliable metric for software interference that can be used in soft-failure protection and recovery systems. A unique feature of our algorithm is that we measure the relative timing of application events (i.e. time between MPI calls) rather than system level events such as CPU utilization. This approach lets our system automatically accommodate natural variations in an application's utilization of resources. We use performance irregularities and degradation as signs of software interference. However, instead of relying on temporal changes in performance, our system detects spatial performance degradation across multiple processors. We also include a case study that demonstrates our technique's effectiveness, resilience and robustness.

    Abstract for A Case Study Using Automatic Performance Tuning for Large-Scale Scientific Programs:

    Active Harmony is an automated runtime performance tuning system. In this paper we describe several case studies of using Active Harmony to improve the performance for scientific libraries and applications. We improved the tuning mechanism so it can work iteratively with benchmarking runs. By tuning the computation and data distribution, Active Harmony helps applications that utilize the PETSc library to achieve better load balance and to reduce the execution time up to 18%. For climate simulation application POP using 480 processors, the tuning results show that by changing the block size and parameter values, the execution time is reduced up to 16.7%. Active Harmony is able to improve the GS2, a plasma physics code, up to a factor of 5.1 times faster. The experiment results show that the Active Harmony system is a feasible and useful tool to automated performance tuning for scientific libraries and applications.

    Abstract for Parallel Parameter Tuning for Applications with Performance Variability:

    In this paper, we present parallel on-line optimization algorithms for parameter tuning of parallel programs. We employ direct search algorithms that update parameters based on real-time performance measurements. We discuss the impact of performance variability on the accuracy and efficiency of the optimization algorithms and propose modified versions of the direct search algorithms to cope with it. The modified version uses multiple samples instead of single sample to estimate the performance more accurately. We present preliminary results that the performance variability of applications on clusters is heavy tailed. Finally, we study and demonstrate performance of the proposed algorithms for a real scientific application.

    Abstract for Automated Cluster-Based Web Service Performance Tuning:

    Active Harmony provides a way to automate performance tuning. In this paper, we apply the Active Harmony system to improve the performance of a cluster-based web service system. The performance improvement cannot easily be achieved by tuning individual components for such a system. The experimental results show that there is no single configuration for the system that performs well for all kinds of workloads. By tuning the parameters, Active Harmony helps the system adapt to different workloads and improve the performance up to 16%. For scalability, we demonstrate how to reduce the time when tuning a large system with many tunable parameters. Finally an algorithm is proposed to automatically adjust the structure of cluster-based web systems, and the system throughput is improved up to 70% using this technique.

    Abstract for Using Information from Prior Runs to Improve Automated Tuning Systems:

    Active Harmony is an automated runtime performance tuning system. In this paper we describe a parameter prioritizing tool to help focus on those parameters that are performance critical. Historical data is also utilized to further speed up the tuning process. We first verify our proposed approaches with synthetic data and finally we verify all the improvements on a real cluster-based web service system. Taken together, these changes allow the Active Harmony system to reduce the time spent tuning from 35% up to 50% and at the same time, reduce the variation in performance while tuning.

    Abstract for Towards Automatic Performance Tuning:

    When the computing environment becomes heterogeneous and applications become modular with reusable components, automatic performance tuning is needed for these applications to run well in different environments. We present the Active Harmony automated runtime tuning system and describe the interface used by programs to make applications tunable. We present the optimization algorithm used to adjust application parameters and the Library Specification Layer which helps program library developers expose multiple variations of the same API using different algorithms. By comparing the experience stored in a database, the tuning server is able to find appropriate configurations more rapidly. Utilizing historical data together with a mechanism that estimates performance speeds up the tuning process. To avoid performance oscillations during the initial phase of the tuning process, we use improved search refinement techniques that use configurations equally spaced throughout the performance search space to make the tuning process smoother. We also introduce a parameter prioritizing tool to focus on those performance critical parameters. We demonstrate how to reduce the time when tuning a large system with many tunable parameters. The search space can be reduced by checking the relations among parameters to avoid unnecessary search. In addition, for homogeneous processing nodes, we demonstrate how to use one set of the parameters and replicate the values to the remaining processing nodes. For environments where parameters can be divided into independent groups, an individual tuning server is used for each group. An algorithm is given to automatically adjust the structure of cluster-based web systems and it improves the system throughput up to 70%. We successfully apply the Active Harmony system to a cluster-based web service system and scientific programs. By tuning the parameters, Active Harmony helps the system adapt to different workloads and improve the performance up to 16%. The performance improvement cannot easily be achieved by tuning individual components for such a system and there is no single configuration that performs well for all kinds of workloads. All the design and experimental results show that Active Harmony is a feasible and useful tool in performance tuning.

    Abstract for Active Harmony: Towards Automated Performance Tuning:

    In this paper, we present the Active Harmony automated runtime tuning system. We describe the interface used by programs to make applications tunable. We present the Library Specification Layer which helps program library developers expose multiple variations of the same API using different algorithms. The Library Specification Language helps to select the most appropriate program library to tune the overall performance. We also present the optimization algorithm used to adjust parameters in the application and the libraries. Finally, we present results that show how the system is able to tune several real applications. The automated tuning system is able to tune the application parameters to within a few percent of the best value after evaluating only 11 out of over 1,700 possible configurations.

    Abstract for Exploiting Idle Cycles In Networks Of Workstations:

    Studies have shown that workstations are idle a significant fraction of the time. Traditional idle resource harvesting systems define a social contract that permits guest jobs to run only when a workstation is idle. To enforce this contract, guest jobs are stopped and migrated as soon as the owner resumes use of their machines. However, such systems waste many opportunities to exploit idle cycles because of overly conservative estimates of resource contention. In this thesis, we present a new policy, called Linger-Longer, that refines the social contract to permit fine-grain cycle stealing. Linger-Longer allows guest jobs to linger on a machine at low priority even when local tasks are active. Also, we developed a new adaptive job migration scheme based on runtime cost/benefit analysis. Our simulation study shows that the Linger-Longer policy can improve the throughput of guest jobs on a cluster by up to 60% with only a few percent slowdown of local jobs. The simulation also demonstrates that guest parallel jobs can perform better with our new approach than with the traditional run-time reconfiguration approach. To limit the impact of guest jobs' resource use, new local resource scheduling policies and mechanisms are required. We present a suite of mechanisms to support prioritized use of CPU, memory, I/O and network bandwidth. An overall prototype of the Linger-Longer system has been implemented in Linux. The prototype integrates the operating system extensions for resource throttling and a new adaptive migration policy module while leveraging general job scheduling and checkpointing mechanisms of an existing system. Using this prototype, we conduct a head-to-head performance comparison between our fine-grain cycle stealing policies and the traditional coarse-grain cycle stealing policies. The experiment on a Linux cluster with a set of benchmark applications show that, overall, Linger-Longer can improve the guest job throughput by 50% to 70%, with only a 3% host job slowdown.

    Abstract for Prediction and Adaptation in Active Harmony:

    We describe the design and early functionality of the Active Harmony global resource management system. Harmony is an infrastructure designed to efficiently execute parallel applications in large-scale, dynamic environments. Harmony differs from other projects with similar goals in that the system automatically adapts ongoing computations to changing conditions through online reconfiguration. This recon-figuration can consist of system-directed migration of work at several different levels, or automatic application adaptation through the use of tuning options exported by Harmony-aware applications. We describe early experience with work migration at the level of procedures, processes and lightweight threads.

    Abstract for Exposing Application Alternatives:

    We present the design of an interface to allow applications to export tuning alternatives to a higher-level system. By exposing different parameters that can be changed at runtime, applications can be made to adapt to changes in their execution environment due to other programs, or the addition or deletion of nodes, communication links etc. An integral part of this interface is that an application not only expose its options, but also the resource utilization of each option and the effect that the option will have on the application's performance. We discuss how these options can be evaluated to tune the overall performance of a collection of applications in the system. Finally, we show preliminary results from a database application that is automatically reconfigured by the system from query shipping to data shipping based on the number of active clients.

    Abstract for Mechanisms and Policies for Supporting Fine-Grained Cycle Stealing:

    This paper presents an investigation into local mechanisms and scheduling policies that allow guest processes to efficiently exploit otherwise-idle workstation resources. Unlike traditional policies that harvest cycles only from unused machines, we target policies that exploit resources even from machines that have active users. We present a set of kernel extensions that allow these policies to operate without significantly impacting host processes: 1) a new guest process priority that prevents processes from stealing any processor time from host processes, 2) a new page replacement policy that imposes hard bounds on the number of physical pages that can be obtained by guest processes when host processes are active, and 3) a new page-out strategy that adaptively increases the pageout rate of guest processes when new host processes are started. We evaluate both the individual impacts of each mechanism, and their utility in supporting Linger-Longer, an aggressive cycle-harvesting policy.

    Abstract for Prediction and Adaptation in Active Harmony:

    Active Harmony is a new infrastructure for managing resources in large, dynamic environments. One of Harmony's major strengths is its ability to gather and exploit many types of application-specific informa-tion. This paper concentrates on two. First, the LBF metric predicts the performance impact of moving procedures and processors. Second, active correlation-tracking is used to gather data sharing information and drive thread migration decision heuristics. However, Harmony's strength is not in the individual mechanisms and metrics used, but in the interfaces that allow global policy decisions and many types of applica-tion-specific information to be tied together. Harmony is a work in progress, and we are continuing to evaluate new sources of information and new mechanisms for possible inclusion in the system.

    First Next Last

    Page 1 of 2