Show simple item record

dc.identifier.urihttp://hdl.handle.net/11401/77471
dc.description.sponsorshipThis work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.en_US
dc.formatMonograph
dc.format.mediumElectronic Resourceen_US
dc.language.isoen_US
dc.publisherThe Graduate School, Stony Brook University: Stony Brook, NY.
dc.typeDissertation
dcterms.abstractHigh speed data replication is vital to the data intensive scientific computing that often requires transferring large volumes of observation and simulation data efficiently among geographically dispersed facilities. Enterprise cloud services also depend on effective data replication to scale out of their own data centers, and to optimize work completed per dollar invested. In addition, end consumers are also extremely sensitive to the latency and responsiveness to share data within the network of friends, web-based services and their personal mobile devices. Recent hardware advances offer the opportunity to enable ultra high-speed data replication by aggressively adding more CPU cores, higher network bandwidth and faster storage into a single commodity server. However, existing monolithic software is designed for the architecture in the past several decades, and not capable of mitigating the system I/O bottlenecks. New effective resource scheduling algorithms and software designs are indispensable to match the I/O bare-metal capability with the actual application’s performance. Designing an end-to-end efficient solution for high speed data replication is non-trivial because of a variety of interconnected factors: 1) the optimal resource scheduling on multicore system is proven to be NP-hard, and even the heuristic algorithm incurs prohibitive computation cost; 2) data replication involves many components along the end-to-end I/O paths, including PCI buses, CPU interconnect links, various I/O controllers and chipsets, each of which can potentially become a performance bottleneck; 3) heterogeneous I/O devices might demonstrate different system performance under various workloads and access patterns. It is up to data replication applications to choose and apply appropriate optimization techniques and to adjust their access patterns to achieve maximum system performance; and 4) the requirement of scaling up to many requests and users necessitates the maximal exploitation of the system parallelism and concurrency that are available in state-of-the-art computer architecture. In conclusion, these challenges give rise to the significant effort to rethink, redesign and re-implement the entire software suite. We first analyze the state-of-the-art I/O devices and multi-core systems by using benchmark tools and monitoring various performance event counters. In addition, we propose a new metric (i.e., the NUMA scheduling factor) and a performance modeling method to describe the accurate I/O performance patterns and to guide the downstream resource scheduling and mapping. Based on our findings, we propose a variety of mathematical and empirical resource scheduling methods to improve the overall system performance. We model the resource mapping for end-to-end data replication. as a min-sum-max resource allocation problem (MSMRAP), prove its prohibitive computation complexity, and give possible solutions. Finally, we integrate the proposed optimization strategies into a complete data replication solution that employs the asynchronous programming paradigm and supports the resources-aware task scheduling and data preprocessing to maximize the capacity of state-of-the-art hardware systems. The evaluation results obtained from a fully-featured WAN network testbed confirm the effectiveness and remarkable performance advantages of our proposed software system for a comprehensive set of workloads, i.e., 28%- 160% higher bandwidth for transferring large files, a factor of 1.7x-66x speed-up for small files, and up to 108% more throughput for mixed workloads, compared to the widely adopted tools, GridFTP, BBCP and Aspera. This dissertation leverages the large body of multicore research already accomplished within the HPC community and implement multicore-aware schedulers to improve processor, memory, and I/O affinities for individual tasks (file caching, compression, encryption, and network transport) involved in the end-to-end data replication. Traditional synchronous processing, storage I/O, and network send/receive, even easy to implement, become bottlenecks in harnessing multi-/many-core architectures. Asynchronous operations, commonly found in RDMA, advanced storage I/O, and exascale computing, demonstrate their superior performance and great flexibility over their synchronous counterparts. We designed an asynchronous high throughput data replication system for multicore/many-core computer platforms to allow users to plug in comprehensive libraries for data compression, encryption, transformation, and checksum for different processing environments. This dissertation paves a way for advancing large scale data transfers in excess of 100 Gbps, and bridging the gap between the bare metal network performance and effective end-to-end data transfer capability. The expected research outcomes will have preeminent visibility in high-speed networks, data management middleware, cloud computing, and exascale supercomputing.
dcterms.abstractHigh speed data replication is vital to the data intensive scientific computing that often requires transferring large volumes of observation and simulation data efficiently among geographically dispersed facilities. Enterprise cloud services also depend on effective data replication to scale out of their own data centers, and to optimize work completed per dollar invested. In addition, end consumers are also extremely sensitive to the latency and responsiveness to share data within the network of friends, web-based services and their personal mobile devices. Recent hardware advances offer the opportunity to enable ultra high-speed data replication by aggressively adding more CPU cores, higher network bandwidth and faster storage into a single commodity server. However, existing monolithic software is designed for the architecture in the past several decades, and not capable of mitigating the system I/O bottlenecks. New effective resource scheduling algorithms and software designs are indispensable to match the I/O bare-metal capability with the actual application’s performance. Designing an end-to-end efficient solution for high speed data replication is non-trivial because of a variety of interconnected factors: 1) the optimal resource scheduling on multicore system is proven to be NP-hard, and even the heuristic algorithm incurs prohibitive computation cost; 2) data replication involves many components along the end-to-end I/O paths, including PCI buses, CPU interconnect links, various I/O controllers and chipsets, each of which can potentially become a performance bottleneck; 3) heterogeneous I/O devices might demonstrate different system performance under various workloads and access patterns. It is up to data replication applications to choose and apply appropriate optimization techniques and to adjust their access patterns to achieve maximum system performance; and 4) the requirement of scaling up to many requests and users necessitates the maximal exploitation of the system parallelism and concurrency that are available in state-of-the-art computer architecture. In conclusion, these challenges give rise to the significant effort to rethink, redesign and re-implement the entire software suite. We first analyze the state-of-the-art I/O devices and multi-core systems by using benchmark tools and monitoring various performance event counters. In addition, we propose a new metric (i.e., the NUMA scheduling factor) and a performance modeling method to describe the accurate I/O performance patterns and to guide the downstream resource scheduling and mapping. Based on our findings, we propose a variety of mathematical and empirical resource scheduling methods to improve the overall system performance. We model the resource mapping for end-to-end data replication. as a min-sum-max resource allocation problem (MSMRAP), prove its prohibitive computation complexity, and give possible solutions. Finally, we integrate the proposed optimization strategies into a complete data replication solution that employs the asynchronous programming paradigm and supports the resources-aware task scheduling and data preprocessing to maximize the capacity of state-of-the-art hardware systems. The evaluation results obtained from a fully-featured WAN network testbed confirm the effectiveness and remarkable performance advantages of our proposed software system for a comprehensive set of workloads, i.e., 28%- 160% higher bandwidth for transferring large files, a factor of 1.7x-66x speed-up for small files, and up to 108% more throughput for mixed workloads, compared to the widely adopted tools, GridFTP, BBCP and Aspera. This dissertation leverages the large body of multicore research already accomplished within the HPC community and implement multicore-aware schedulers to improve processor, memory, and I/O affinities for individual tasks (file caching, compression, encryption, and network transport) involved in the end-to-end data replication. Traditional synchronous processing, storage I/O, and network send/receive, even easy to implement, become bottlenecks in harnessing multi-/many-core architectures. Asynchronous operations, commonly found in RDMA, advanced storage I/O, and exascale computing, demonstrate their superior performance and great flexibility over their synchronous counterparts. We designed an asynchronous high throughput data replication system for multicore/many-core computer platforms to allow users to plug in comprehensive libraries for data compression, encryption, transformation, and checksum for different processing environments. This dissertation paves a way for advancing large scale data transfers in excess of 100 Gbps, and bridging the gap between the bare metal network performance and effective end-to-end data transfer capability. The expected research outcomes will have preeminent visibility in high-speed networks, data management middleware, cloud computing, and exascale supercomputing.
dcterms.available2017-09-20T16:52:45Z
dcterms.contributorYu, Dantongen_US
dcterms.contributorTang, Wendyen_US
dcterms.contributorJin, Shudongen_US
dcterms.contributorFerdman, Mikeen_US
dcterms.contributorHarrison, Robert.en_US
dcterms.creatorLi, Tan
dcterms.dateAccepted2017-09-20T16:52:45Z
dcterms.dateSubmitted2017-09-20T16:52:45Z
dcterms.descriptionDepartment of Electrical Engineering.en_US
dcterms.extent153 pg.en_US
dcterms.formatMonograph
dcterms.formatApplication/PDFen_US
dcterms.identifierhttp://hdl.handle.net/11401/77471
dcterms.issued2015-12-01
dcterms.languageen_US
dcterms.provenanceMade available in DSpace on 2017-09-20T16:52:45Z (GMT). No. of bitstreams: 1 Li_grad.sunysb_0771E_12634.pdf: 1870519 bytes, checksum: e9c4fbbd7e530782ddcd78860ec95150 (MD5) Previous issue date: 1en
dcterms.publisherThe Graduate School, Stony Brook University: Stony Brook, NY.
dcterms.subjectData Replication, High Performance, Input/Output (!/O), Multicore, Non-Uniform Memory Access (NUMA)
dcterms.subjectElectrical engineering
dcterms.titleHarnessing Multicore Parallelism for High Performance Data Replication
dcterms.typeDissertation


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record