Zurück zur News-Liste

An introduction to Big Data Processing with Apache Spark

21.01.2018

Ein Beitrag von Geraldo Souza

Every second, a huge amount of public data is generated and stored worldwide. This amount is evering. Companies such as banks, airlines, telephone operators, retail chains and government agencies generate millions of data daily. When this information is stored, analyzed and systematized for the generation of results, a relatively recent term comes into play, but it has become more and more popular: the big data.

To deal with this huge amount of data, new concepts, perspectives and paradigms have come to light. Apache Hadoop was (and still is) a strong framework based on the Mapreduce paradigm developed by Google. With time, new tools have been developed to solve new problems (Big data, Big problems).

One of these tools is Apache Spark. Spark is a Big Data framework that aims to process large data sets in a parallel and distributed way. It extends the MapReduce programming model popularized by Apache Hadoop, making it much easier to develop applications for processing large volumes of data. In addition to the extended programming model, Spark also has a performance far superior to Hadoop, in some cases achieving a performance almost 100x greater.