dc.identifier.uri	http://hdl.handle.net/11401/77471
dc.description.sponsorship	This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.	en_US
dc.format	Monograph
dc.format.medium	Electronic Resource	en_US
dc.language.iso	en_US
dc.publisher	The Graduate School, Stony Brook University: Stony Brook, NY.
dc.type	Dissertation
dcterms.abstract	High speed data replication is vital to the data intensive scientific computing that often requires transferring large volumes of observation and simulation data efficiently among geographically dispersed facilities. Enterprise cloud services also depend on effective data replication to scale out of their own data centers, and to optimize work completed per dollar invested. In addition, end consumers are also extremely sensitive to the latency and responsiveness to share data within the network of friends, web-based services and their personal mobile devices. Recent hardware advances offer the opportunity to enable ultra high-speed data replication by aggressively adding more CPU cores, higher network bandwidth and faster storage into a single commodity server. However, existing monolithic software is designed for the architecture in the past several decades, and not capable of mitigating the system I/O bottlenecks. New effective resource scheduling algorithms and software designs are indispensable to match the I/O bare-metal capability with the actual applicationâ€™s performance. Designing an end-to-end efficient solution for high speed data replication is non-trivial because of a variety of interconnected factors: 1) the optimal resource scheduling on multicore system is proven to be NP-hard, and even the heuristic algorithm incurs prohibitive computation cost; 2) data replication involves many components along the end-to-end I/O paths, including PCI buses, CPU interconnect links, various I/O controllers and chipsets, each of which can potentially become a performance bottleneck; 3) heterogeneous I/O devices might demonstrate different system performance under various workloads and access patterns. It is up to data replication applications to choose and apply appropriate optimization techniques and to adjust their access patterns to achieve maximum system performance; and 4) the requirement of scaling up to many requests and users necessitates the maximal exploitation of the system parallelism and concurrency that are available in state-of-the-art computer architecture. In conclusion, these challenges give rise to the significant effort to rethink, redesign and re-implement the entire software suite. We first analyze the state-of-the-art I/O devices and multi-core systems by using benchmark tools and monitoring various performance event counters. In addition, we propose a new metric (i.e., the NUMA scheduling factor) and a performance modeling method to describe the accurate I/O performance patterns and to guide the downstream resource scheduling and mapping. Based on our findings, we propose a variety of mathematical and empirical resource scheduling methods to improve the overall system performance. We model the resource mapping for end-to-end data replication. as a min-sum-max resource allocation problem (MSMRAP), prove its prohibitive computation complexity, and give possible solutions. Finally, we integrate the proposed optimization strategies into a complete data replication solution that employs the asynchronous programming paradigm and supports the resources-aware task scheduling and data preprocessing to maximize the capacity of state-of-the-art hardware systems. The evaluation results obtained from a fully-featured WAN network testbed confirm the effectiveness and remarkable performance advantages of our proposed software system for a comprehensive set of workloads, i.e., 28%- 160% higher bandwidth for transferring large files, a factor of 1.7x-66x speed-up for small files, and up to 108% more throughput for mixed workloads, compared to the widely adopted tools, GridFTP, BBCP and Aspera. This dissertation leverages the large body of multicore research already accomplished within the HPC community and implement multicore-aware schedulers to improve processor, memory, and I/O affinities for individual tasks (file caching, compression, encryption, and network transport) involved in the end-to-end data replication. Traditional synchronous processing, storage I/O, and network send/receive, even easy to implement, become bottlenecks in harnessing multi-/many-core architectures. Asynchronous operations, commonly found in RDMA, advanced storage I/O, and exascale computing, demonstrate their superior performance and great flexibility over their synchronous counterparts. We designed an asynchronous high throughput data replication system for multicore/many-core computer platforms to allow users to plug in comprehensive libraries for data compression, encryption, transformation, and checksum for different processing environments. This dissertation paves a way for advancing large scale data transfers in excess of 100 Gbps, and bridging the gap between the bare metal network performance and effective end-to-end data transfer capability. The expected research outcomes will have preeminent visibility in high-speed networks, data management middleware, cloud computing, and exascale supercomputing.
dcterms.abstract	High speed data replication is vital to the data intensive scientific computing that often requires transferring large volumes of observation and simulation data efficiently among geographically dispersed facilities. Enterprise cloud services also depend on effective data replication to scale out of their own data centers, and to optimize work completed per dollar invested. In addition, end consumers are also extremely sensitive to the latency and responsiveness to share data within the network of friends, web-based services and their personal mobile devices. Recent hardware advances offer the opportunity to enable ultra high-speed data replication by aggressively adding more CPU cores, higher network bandwidth and faster storage into a single commodity server. However, existing monolithic software is designed for the architecture in the past several decades, and not capable of mitigating the system I/O bottlenecks. New effective resource scheduling algorithms and software designs are indispensable to match the I/O bare-metal capability with the actual application’s performance. Designing an end-to-end efficient solution for high speed data replication is non-trivial because of a variety of interconnected factors: 1) the optimal resource scheduling on multicore system is proven to be NP-hard, and even the heuristic algorithm incurs prohibitive computation cost; 2) data replication involves many components along the end-to-end I/O paths, including PCI buses, CPU interconnect links, various I/O controllers and chipsets, each of which can potentially become a performance bottleneck; 3) heterogeneous I/O devices might demonstrate different system performance under various workloads and access patterns. It is up to data replication applications to choose and apply appropriate optimization techniques and to adjust their access patterns to achieve maximum system performance; and 4) the requirement of scaling up to many requests and users necessitates the maximal exploitation of the system parallelism and concurrency that are available in state-of-the-art computer architecture. In conclusion, these challenges give rise to the significant effort to rethink, redesign and re-implement the entire software suite. We first analyze the state-of-the-art I/O devices and multi-core systems by using benchmark tools and monitoring various performance event counters. In addition, we propose a new metric (i.e., the NUMA scheduling factor) and a performance modeling method to describe the accurate I/O performance patterns and to guide the downstream resource scheduling and mapping. Based on our findings, we propose a variety of mathematical and empirical resource scheduling methods to improve the overall system performance. We model the resource mapping for end-to-end data replication. as a min-sum-max resource allocation problem (MSMRAP), prove its prohibitive computation complexity, and give possible solutions. Finally, we integrate the proposed optimization strategies into a complete data replication solution that employs the asynchronous programming paradigm and supports the resources-aware task scheduling and data preprocessing to maximize the capacity of state-of-the-art hardware systems. The evaluation results obtained from a fully-featured WAN network testbed confirm the effectiveness and remarkable performance advantages of our proposed software system for a comprehensive set of workloads, i.e., 28%- 160% higher bandwidth for transferring large files, a factor of 1.7x-66x speed-up for small files, and up to 108% more throughput for mixed workloads, compared to the widely adopted tools, GridFTP, BBCP and Aspera. This dissertation leverages the large body of multicore research already accomplished within the HPC community and implement multicore-aware schedulers to improve processor, memory, and I/O affinities for individual tasks (file caching, compression, encryption, and network transport) involved in the end-to-end data replication. Traditional synchronous processing, storage I/O, and network send/receive, even easy to implement, become bottlenecks in harnessing multi-/many-core architectures. Asynchronous operations, commonly found in RDMA, advanced storage I/O, and exascale computing, demonstrate their superior performance and great flexibility over their synchronous counterparts. We designed an asynchronous high throughput data replication system for multicore/many-core computer platforms to allow users to plug in comprehensive libraries for data compression, encryption, transformation, and checksum for different processing environments. This dissertation paves a way for advancing large scale data transfers in excess of 100 Gbps, and bridging the gap between the bare metal network performance and effective end-to-end data transfer capability. The expected research outcomes will have preeminent visibility in high-speed networks, data management middleware, cloud computing, and exascale supercomputing.
dcterms.available	2017-09-20T16:52:45Z
dcterms.contributor	Yu, Dantong	en_US
dcterms.contributor	Tang, Wendy	en_US
dcterms.contributor	Jin, Shudong	en_US
dcterms.contributor	Ferdman, Mike	en_US
dcterms.contributor	Harrison, Robert.	en_US
dcterms.creator	Li, Tan
dcterms.dateAccepted	2017-09-20T16:52:45Z
dcterms.dateSubmitted	2017-09-20T16:52:45Z
dcterms.description	Department of Electrical Engineering.	en_US
dcterms.extent	153 pg.	en_US
dcterms.format	Monograph
dcterms.format	Application/PDF	en_US
dcterms.identifier	http://hdl.handle.net/11401/77471
dcterms.issued	2015-12-01
dcterms.language	en_US
dcterms.provenance	Made available in DSpace on 2017-09-20T16:52:45Z (GMT). No. of bitstreams: 1 Li_grad.sunysb_0771E_12634.pdf: 1870519 bytes, checksum: e9c4fbbd7e530782ddcd78860ec95150 (MD5) Previous issue date: 1	en
dcterms.publisher	The Graduate School, Stony Brook University: Stony Brook, NY.
dcterms.subject	Data Replication, High Performance, Input/Output (!/O), Multicore, Non-Uniform Memory Access (NUMA)
dcterms.subject	Electrical engineering
dcterms.title	Harnessing Multicore Parallelism for High Performance Data Replication
dcterms.type	Dissertation

Files in this item

Name:: Li_grad.sunysb_0771E_12634.pdf
Size:: 1.783Mb
Format:: application/pdf

View/Open

This item appears in the following Collection(s)

Stony Brook Theses and Dissertations Collection [4009]

Show simple item record

Harnessing Multicore Parallelism for High Performance Data Replication

Files in this item

This item appears in the following Collection(s)