Data availability analysis in P2P networks

Sanaullah, Nazir
P2P network architectures have gained popularity as applications for sharing files between users. A P2P network provides a scalable, robust, and economical storage architecture. These features have led to the extended use of P2P network applications, ranging from file sharing to data sharing for video and telecommunication domains. The shift in storage system being used from high cost, reliable servers to usercentered storage devices led to reliability and availability problems for the P2P network. Peers are machines of users that can go offline at any time. The data stored on the machines are not available during the offline time. Data replication is a common approach for handling data unavailability, which is where multiple copies of files are placed on different peers in the network. In data replication, peers transfer complete/partial data to other nodes. Therefore, data replication provides higher data availability in case of churn. I present data replication algorithms in this thesis to improve the availability of data in the network. With an increase in availability and overhead, the basic challenges faced during the development of data replication algorithm are: (i) How many replicas for a data object should be created? (ii) On which peer(s) should the replicated data objects be stored? (iii) Which files should be replicated? Initial work in data replication considered the static replication of data based on the overall availability of nodes in the network. These approaches overestimated the number of replicas, which lead to high maintenance costs. Dynamic approaches for estimating replica numbers were developed to handle this issue. From the analysis of the current approaches, I found that the proposed mechanisms for dynamic approaches to replication did not provide a balanced replication of data. Data were only replicated to highly available nodes, which were overloaded with data. The second issue was the inability to adapt to the changing behaviour of peers. In this thesis, I present an approach that selects a node set comprised of both highly available and lowly available nodes, in order to provide load balancing in the network. I provide a feedback-based approach where previous behaviours are incorporated in the next behavioural analysis. Compared to the existing approaches to replica calculation, this approach is able to determine the appropriate number of replicas and placement locations with the changing dynamics of the system. The replication system relies on node behaviour prediction algorithms using Monte Carlo simulation and Time series analysis. Each node performs an analysis on the historical traces of its online and offline times in the network. Each node shares the availability log with the replication initiator node, and the prediction of future behaviour is made based on the logs received. The data-owning peer uses this information to run the replica placement algorithm to select nodes that are present for a particular duration, supporting the presence of each others in the network. Partial data replication is supported by the system by applying Zipf distribution to calculate the most popular files. I performed the evaluation using my replication approach and dynamic replica placement algorithms, based on the following parameters: replica count, reliability of data, average availability of nodes in the replica set, and failure analysis for querying data. The replica count analysis shows that the number of replicas required were almost half compared to the previous dynamic approaches. The reliability analysis shows that overall reliability of the data was better in this approach compared to the other dynamic replica placement algorithms. My replication algorithm produced replica sets with a lower average availability compared to the replica set of the other approaches, but the reliability analysis suggests that my approach distributes data more evenly between nodes, resulting in better overall data availability. The availability of data in the network was higher than other approaches. The failure analysis for request failures for data shows that my replication algorithm has a better node selection mechanism compared to other approaches, with better data availability.
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland