Show simple item record

dc.identifier.urihttp://hdl.handle.net/11401/78358
dc.description.sponsorshipThis work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degreeen_US
dc.formatMonograph
dc.format.mediumElectronic Resourceen_US
dc.format.mimetypeApplication/PDFen_US
dc.language.isoen_US
dc.typeDissertation
dcterms.abstractAs we approach the post-Moore’s law era, the burden of achieving good performance is shifting from hardware architects to software engineers. Programmers can no longer expect a doubling of performance for each hardware generation. The trend has moved from faster CPU’s to increased parallelism through multiple cores and specialized instructions (e.g. SIMD intrinsics). Taking advantage of the parallelism inherent in modern hardware, however, is a non-trivial task. The programmer is required to either spend significant time to optimize the code for a parallel architecture, or use automatic methods to generate the code (e.g. through compiler optimizations). This dissertation addresses each of these methods of obtaining optimized code and presents novel methods for easing the burden of optimization on the programmer. In the first part of this dissertation, we explore three case studies where we present successful optimizations for different architectures. In the first two case studies, optimizations are presented for GPGPU (General Purpose GPU) accelerators using the CUDA programming language. In the first case study, we present an optimized version of the FDK back-projection algorithm which is commonly used in CT (Computed Tomography) reconstruction. The second case study details how we map an inherently sequential clustering algorithm to a parallel architecture. In the third case study, we detail our optimizations for the LQCD (Lattice Quantum Chromo-Dynamics) physics simulation algorithm. In this study, we describe how we get good performance by combining AVX (Advanced Vector Extension) intrinsics, OpenMP, and MPI. In the second part of this dissertation, we describe an iterative optimization framework based on the ant colony optimization algorithm. Optimization decisions are formulated as a directed acyclic graph. The ant colony optimization algorithm is then used to identify a path through this graph that corresponds to an optimal GPU implementation. This framework is then extended by tightly integrating it with a compiler based on the polyhedral model; thus, allowing it to identify optimal skewing and permutation transformations as well as loop and GPU kernel fusion. Several optimizations not incorporated into previous GPU auto-tuners are also evaluated – including texture memory and parallel reduction. The framework also extends the traditional ant colony optimization algorithm to include performance metrics as well as a regression tree analysis to segment the search space into regions with promising performance. Results show significant speed-up over a GPU code generator based on the polyhedral model alone. In part 3 of this dissertation, we present a visualization interface which allows users to quickly explore a number of optimizations. This interface is constructed on top of R-Stream – an optimizing compiler based on the polyhedral model. This interface visualizes performance through heuristics and run time performance data and allows users to identify optimization opportunities and apply common loop transformations in a visually intuitive way. User studies show that users were able to identify optimizations that outperformed R-Stream alone. This interface also greatly improved the user’s understanding of the transformations made by R-Stream and the reasons behind them.
dcterms.available2018-07-09T14:14:21Z
dcterms.contributorMueller, Klaus.en_US
dcterms.contributorChapman, Barbaraen_US
dcterms.contributorLiu, Annieen_US
dcterms.contributorMeister, Benoit.en_US
dcterms.creatorPapenhausen, Eric
dcterms.dateAccepted2018-07-09T14:14:21Z
dcterms.dateSubmitted2018-07-09T14:14:21Z
dcterms.descriptionDepartment of Computer Science.en_US
dcterms.extent214 pg.en_US
dcterms.formatMonograph
dcterms.identifierhttp://hdl.handle.net/11401/78358
dcterms.identifierPapenhausen_grad.sunysb_0771E_13395.pdfen_US
dcterms.issued2017-08-01
dcterms.languageen_US
dcterms.provenanceSubmitted by Jason Torre (fjason.torre@stonybrook.edu) on 2018-07-09T14:14:21Z No. of bitstreams: 1 Papenhausen_grad.sunysb_0771E_13395.pdf: 7109738 bytes, checksum: 755cf975e90f67cd899a47b057433b5e (MD5)en
dcterms.provenanceMade available in DSpace on 2018-07-09T14:14:21Z (GMT). No. of bitstreams: 1 Papenhausen_grad.sunysb_0771E_13395.pdf: 7109738 bytes, checksum: 755cf975e90f67cd899a47b057433b5e (MD5) Previous issue date: 2017-08-01en
dcterms.subjectComputer science
dcterms.subjectGPU
dcterms.subjectIterative Optimization
dcterms.subjectPerformance Optimization
dcterms.subjectPolyhedral Model
dcterms.titleManual, Automatic, and Semi-Automatic Performance Optimization
dcterms.typeDissertation


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record