
Models of Data Storage Tables Connection Processes by MapReduce/Spark Technology

Authors: Proletarskaya V.А., Grigoriev Yu.A. Published: 13.10.2019
Published in issue: #5(128)/2019  
DOI: 10.18698/0236-3933-2019-5-79-94

Category: Informatics, Computer Engineering and Control | Chapter: Mathematical Modelling, Numerical Methods, and Program Complexes  
Keywords: Big Data, MapReduce, Apache Spark, Spark SQL, Blue-Ma filter, TPC-H, process models, data storage

A model has been developed and an estimate of the amount of data transmitted over the network has been obtained with duplicating the table across nodes and using the Bloom filter in the MapReduce/Spark environment. Models have been developed for fulfilling queries for joining database tables in the cascading Bloom filter in the same environment. Two cases of joining tables are considered: 1) several bushes with one dimension in each of them; 2) one bush with several dimensions --- star-type storage. We obtained an estimate of the Bloom filter volume transmitted over the network when the tables are joined. Using the example of the Q3 request from the TPC-H test, we analyzed the adequacy of the estimated gain in the amount of data transmitted over the network using the cascading Bloom filter. The prediction error was 2 %


