Main Catalog Informatics, Computer Engineering and Control Mathematical Modelling, Numerical Methods, and Program Complexes

Models of Data Storage Tables Connection Processes by MapReduce/Spark Technology

Authors: Proletarskaya V.А., Grigoriev Yu.A.	Published: 13.10.2019
Published in issue: #5(128)/2019
DOI: 10.18698/0236-3933-2019-5-79-94
Category: Informatics, Computer Engineering and Control \| Chapter: Mathematical Modelling, Numerical Methods, and Program Complexes
Keywords: Big Data, MapReduce, Apache Spark, Spark SQL, Blue-Ma filter, TPC-H, process models, data storage

A model has been developed and an estimate of the amount of data transmitted over the network has been obtained with duplicating the table across nodes and using the Bloom filter in the MapReduce/Spark environment. Models have been developed for fulfilling queries for joining database tables in the cascading Bloom filter in the same environment. Two cases of joining tables are considered: 1) several bushes with one dimension in each of them; 2) one bush with several dimensions --- star-type storage. We obtained an estimate of the Bloom filter volume transmitted over the network when the tables are joined. Using the example of the Q3 request from the TPC-H test, we analyzed the adequacy of the estimated gain in the amount of data transmitted over the network using the cascading Bloom filter. The prediction error was 2 %

References

[1] Grigoryev Yu.A., Plutenko A.D., Pluzhnikov V.L., et al. Teoriya i praktika analiza parallelnykh sistem baz dannykh [Theory and practice of analyzing parallel database systems]. Vladivostok, Dalnauka Publ., 2015.

[2] Sadalage P., Fowler M. NoSQL Distilled: a brief guide to the emerging world of polyglot persistence. Addison Wesley Professional, 2013.

[3] Perkins L., Redmond E., Wilson J.R. Seven databases in seven weeks: a guide to modern databases and the NoSQL movement. Pragmatic Bookshelf, 2018.

[4] Burdakov A., Grigorev U., Ploutenko A., et al. Estimation models for NoSQL database consistency characteristics. 24th Euromicro Int. Conf. Parallel, Distributed, and Network-Based Processing (PDP), 2016, pp. 35--42. DOI: 10.1109/PDP.2016.23

[5] Aslett M. How will the database incumbents respond to NoSQL and NewSQL? Cs.brown.edu: website. URL: http://cs.brown.edu/courses/cs227/archives/2012/papers/newsql/aslett-newsql.pdf (accessed: 20.03.2019).

[6] Pavlo A., Aslett M. What’s really new with NewSQL? Sigmod Rec., 2016, vol. 45, no. 2, pp. 45--55. DOI: 10.1145/3003665.3003674

[7] Dean J., Ghemawat S. MapReduce: simplified data processing on large clusters. CACM, 2008, vol. 51, iss. 1, pp. 107--113. DOI: 10.1145/1327452.1327492

[8] White T. Hadoop: The definitive guide. O’Reilly Media, 2015.

[9] Zaharia M., Clowdhury M., Franklin M.J., et al. Spark: cluster computing with working sets. Proc HotCloud, 2010, vol. 10, no. 10--10, pp. 1--7.

[10] Karau H., Konwinski A., Wendell P.,et al. Learning spark: lightning-fast big data analysis, O’Reilly Media, 2015.

[11] Karau H., Warren R. High performance Spark: best practices for scaling and optimizing Apache Spark, O’Reilly Media, 2017.

[12] Brito J.J., Mosqueiro T., Ciferri R.R., et al. Faster cloud Star Joins with reduced disk spill and network communication. Procedia Comput. Sci., 2016, vol. 80, pp. 74--85. DOI: https://doi.org/10.1016/j.procs.2016.05.299

[13] Bloom B.H. Space/time trade-offs in hash coding with allowable errors. CACM, 1970, vol. 13, iss. 7, pp. 422--426. DOI: 10.1145/362686.362692

[14] Tarkoma S., Rothenberg C.E., Lagerspetz E. Theory and practice of bloom filters for distributed systems. IEEE Commun. Surv. Tutor., 2012, vol. 14, iss. 1, pp. 131--155. DOI: 10.1109/SURV.2011.031611.00024

[15] Kleppmann M. The big ideas behind reliable, scalable, and mantainable systems. O’Reilly Media, 2017.

[16] Grigoryev Yu.A., Proletarskaya V.A., Ermakov E.Yu. Access method to the warehouse using Spark technology with cascading Bloom filter. Informatika i sistemy upravleniya, 2017, no. 1, pp. 3--14 (in Russ.).

[17] Grigoryev Yu.A., Proletarskaya V.A., Ermakov E.Yu. Experimental efficiency verification of an access method to the storage data on the Spark platform using cascading Bloom filter. Informatika i sistemy upravleniya, 2017, no. 3, pp. 3--16 (in Russ.).

[18] Grigoriev Yu.A., Proletarskaya V.A., Ermakov E.Yu., et al. Efficiency analysis of the access method with the cascading Bloom filter to the data warehouse on the parallel computing platform. J. Phys.: Conf. Ser., 2017, vol. 913, no. 1, art. 012011. DOI: https://doi.org/10.1088/1742-6596/913/1/012011

[19] [SPARK-21039] [Spark Core] Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter #18263. github.com: website. Available at: https://github.com/apache/spark/pull/18263 (accessed: 03.04.2019).

[20] RDD.scala. github.com: website. Available at: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala (accessed: 03.04.2019).

[21] Vavilapalli V.K., Murthy A., Douglas C., et al. Apache hadoop yarn: yet another resource negotiator. Proc. 4th ann. Symp. Cloud Computing. ACM, 2013, art. 5. DOI: 10.1145/2523616.2523633

[22] TPC-H. TPC.org: website. Available at: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.2.pdf (accessed 03.04.2019).