White Paper

To keep reading or download the pdf

Fill out the Form

 

Read Time:

10 min.

 

Prepared for:

vertica_pos_blk_rgb

 

Font Size:

 

Font Weight:

Optimize Analytical Processing

Balance Costs and Performance Between MPP Databases and Apache Spark

Different Designs for Different Functions

Apache Spark and massively parallel processing (MPP) analytical databases are designed for different things. The first generation of “big data” architectures relied upon the distributed Hadoop and MapReduce framework for analytical processing. This framework provided a breakthrough in that it increased the amount of data that could be processed, but it operated in batch mode which limited its applicability for interactive analyses. Spark removed the batch processing limitation of MapReduce thus making interactive analyses on big data practical. It also provided capabilities for streaming analyses and machine learning, but it does not include its own persistent storage layer.

Distributed MPP systems are designed for scalable, high-performance analytical database operations. These database systems spread processing across multiple compute resources to provide scalability and enhance performance while maintaining transactional consistency with support for data updates and deletes. Many applications require transactional consistency or repeatability—for example, customer billing or financial systems—that the relational database technology underlying MPP systems provides. These systems also use a variety of optimization techniques to deliver very high performance when executing a wide variety of analyses, including those involving small numbers of records or very large numbers of records. And while the best implementations of MPP systems are not limited to only SQL processing, the wide availability of SQL skills and tools make it easier to deploy and integrate them into an organization’s information architecture.

 
 

Fill out the form to continue reading