Welcome!

Article

Performance Testing of Hive, esProc, and Impala | Part 2

Comparison of Hive, Impala and esProc in terms of computing performance

In the previous article, we've tested the grouping computing. In this article, we will test their performances and compare their results in associating computing.

Associating computing test on narrow tables

Data sample:

Associated table p_narrow.

Col. count: 11

Row count: 500 million

Space occupied if saving as text: 120. 6G.

Data structure: personid int,name string,sex int,cityid int,birthday int,degree int,col1 string,col2 int,col3 int,col4 int,col5 string

Dimension table d_narrow

Col. count: 9

Row count: 10 million rows

Space occupied if saving as text: 563 M.

Data structure: id int, parentid int, col1 int, col2 int, col3 int, col4 int, col5 int, col6 int, col7 int

Description:

Associated table: It is similar to joining the table on the left with SQL, and there are quite a lot of rows, for example, the order table.

Dimension table: It is similar to joining the table on the right with SQL, and there are quite a lot of rows, for example, the client ID and client name table.

Test case:

Hive:

select sum(p_narrow. col3), count(p_narrow. col5), sum(d_narrow. col7), d_narrow. id%10000 from p_narrow join d_narrow on d_narrow. id=p_narrow. col7 group by d_narrow. id%10000

esProc: The codes can be divided into 3 parts. They are respectively: Program for summary machine, main program for node machine, and subprogram for node machine.

Impala:

select sum(p_narrow. col3), count(p_narrow. col5), sum(d_narrow. col7), d_narrow. id%10000 from p_narrow join d_narrow on d_narrow. id=p_narrow. col7 group by d_narrow. id%10000

Test results:

Hive

Impala

esProc

773s

262s

279s

Result description:

1.       esProc and Impala outperform Hive obviously, almost 3 times better.

2.       Impala is slightly better than esProc, but the difference is not great.

Associating computation test on narrow tables

Data sample:

Associated tablep

Col. count: 106

Row count: 60 million rows

Space occupied if saving as text: 127. 9G.

Data structure: personid int,name string,sex int,cityid int,birthday int,degree int,col1 int,col2 int,col3 int,col4 int,col5 int,col6 int,col7 int,col8 int,col9 int,col10 int,col11 int,col12 int,col13 int,col14 int,col15 int,col16 int,col17 int,col18 int,col19 int,col20 int,col21 int,col22 int,col23 int,col24 int,col25 int,col26 int,col27 int,col28 int,col29 int,col30 int,col31 int,col32 int,col33 int,col34 int,col35 int,col36 int,col37 int,col38 int,col39 int,col40 int,col41 int,col42 int,col43 int,col44 int,col45 int,col46 int,col47 int,col48 int,col49 int,col50 int,col51 int,col52 int,col53 int,col54 int,col55 int,col56 int,col57 int,col58 int,col59 int,col60 int,col61 int,col62 int,col63 int,col64 int,col65 int,col66 int,col67 int,col68 int,col69 int,col70 int,col71 int,col72 int,col73 int,col74 int,col75 int,col76 int,col77 int,col78 int,col79 int,col80 int,col81 int,col82 int,col83 int,col84 string,col85 string,col86 string,col87 string,col88 string,col89 string,col90 string,col91 string,col92 string,col93 string,col94 string,col95 string,col96 string,col97 string,col98 string,col99 string,col100 string

Dimension table d

Col. count: 102

Row count: 10 million rows

Space occupied if saving as text: 6. 8G

Data structure: id int, parentid int,col1 int,col2 int,col3 int,col4 int,col5 int,col6 int,col7 int,col8 int,col9 int,col10 int,col11 int,col12 int,col13 int,col14 int,col15 int,col16 int,col17 int,col18 int,col19 int,col20 int,col21 int,col22 int,col23 int,col24 int,col25 int,col26 int,col27 int,col28 int,col29 int,col30 int,col31 int,col32 int,col33 int,col34 int,col35 int,col36 int,col37 int,col38 int,col39 int,col40 int,col41 int,col42 int,col43 int,col44 int,col45 int,col46 int,col47 int,col48 int,col49 int,col50 int,col51 int,col52 int,col53 int,col54 int,col55 int,col56 int,col57 int,col58 int,col59 int,col60 int,col61 int,col62 int,col63 int,col64 int,col65 int,col66 int,col67 int,col68 int,col69 int,col70 int,col71 int,col72 int,col73 int,col74 int,col75 int,col76 int,col77 int,col78 int,col79 int,col80 int,col81 int,col82 int,col83 int,col84 int,col85 int,col86 int,col87 int,col88 int,col89 int,col90 int,col91 int,col92 int,col93 int,col94 int,col95 int,col96 int,col97 int,col98 int,col99 int,col100 int         Description:

Associated table: It is similar to joining the table on the left with SQL, and there are quite a lot of rows, for example, the order table.

Dimension table: It is similar to joining the table on the right with SQL, and there are quite a lot of rows, for example, the client ID and client name table.

Test case:

Hive:

select sum(p. col3), count(p. col5), sum(d. col7), d. id%10000 from p join d on d. id=p. col7 group by d. id%10000

esProc: The codes can be divided into 3 parts. They are respectively: Program for summary machine, main program for node machine, and subprogram for node machine.

Impala:

select sum(p. col3), count(p. col5), sum(d. col7), d. id%10000 from p join d on d. id=p. col7 group by d. id%10000

Test results:

Hive

Impala

esProc

525s

269s

268s

Result description:

Let's conclude the results of the four tests, and explain it one by one.

Grouping and Summarizing for Narrow Table

Test case

Hive

Impala

esProc

1 col. for grouping and 1 col. for summarizing

501s

256s

233s

1 col. for grouping and 4 col. for summarizing

508s

254s

237s

4 col. for grouping and 1 col. for summarizing

509s

253s

237s

4 col. for grouping and 4 col. for summarizing

536s

255s

237s

1.       esProc and Impala outperforms Hive obviously, almost 1 time or above.

2.       The performance of esProc is a bit stronger than Impala, but the superiority is not great.

3.       The column counts for grouping and summarizing do not have much impact on the performance of the three solutions.

Grouping and summarizing for wide table

Grouping col. * Summarizing col.

Hive

Impala

esProc

1 col. for grouping and 1 col. for summarizing

457s

272s

218s

1 col. for grouping and 4 col. for summarizing

458s

265s

218s

4 col. for grouping and 1 col. for summarizing

475s

266s

219s

4 col. for grouping and 4 col. for summarizing

488s

271s

218s

1.       esProc and Impala outperforms Hive obviously, almost 1 time or above.

2.       The performance of esProc is a bit stronger than Impala, but the superiority is not great.

3.       The column counts for grouping and summarizing do not have much impact on the performance of the three solutions.

4.       Compare with the data from narrow tables. You may find that the table columns make no difference on performance, while the volume of the whole table has direct impact on the performance. In addition, for the wide table, the performance of Impala will drop slightly, while the performance of Hive and esProc will increase a bit.

Associating computation on narrow tables

Hive

Impala

esProc

773s

262s

279s

1.       esProc and Impala outperform Hive obviously, almost 3 times better.

2.       The performance of Impala is slightly stronger than esProc, but the superiority is not great.

Associating computation on wide table

Hive

Impala

esProc

525s

269s

268s

1.       esProc and Impala outperform Hive greatly, almost 2 times higher.

2.       Impala performs slower than that of esProc by 1 second. Despite this slight difference, both of them can be regarded as performing equally good.

Interpretation and Analysis:

The performance of Hive is rather poor, which is easy to understand: as the infrastructure of Hive, MapReduce exchanges the data between computational nodes via files in external storage, so a great deal of time is spent on the hard disk IO. Impala and esProc offer the better performance because they exchange the intermediate result through memory directly. But, the performance of Impala is not as better than Hive for dozens of times as widely believed.

Exchanging data in the form of files do bring some benefits, which can actually ensure the reliability of intermediate result in the unstable environment of large cluster. esProc supports two ways to exchange the data (depend on programmer's choice). Impala only supports the direct exchange, and Hive only supports the file exchange.

For grouping and summarizing, esProc performs better than Impala a bit. This is mainly because esProc enables the direct access to the local disk. By comparison, Impala must rely on HDFS to access to the hard disk. The process gets slow down naturally when there is a more layer of control.

However, in the associating computation, we may find that the data processing performances of esProc and Impala are contrary to that in grouping and summarizing. The performance of esProc is equal to or slightly stronger than Impala. It is probably because that the Impala implemented the technology of localizing the code generation. In CPU computing, its performance is slightly higher than esProc that executing codes by interpreting. So, although Impala relies on HDFS to access the hard disk, the high efficiency of CPU saves the time and situation. . As you can imagine, in grouping and summarizing, the time spent on hard disk access is much greater than CPU computing. While in the associating computation, the time spent on CPU computing gets greater, so that the Impala will overtake esProc. In addition, according to the analysis, it is not difficult to reach the conclusion that the workload ratio between the CPU computation and the hard disk access for narrow table operations is greater than that for wide table. The test data also tells that the advantage for Impala performance is much more obvious when handling the narrow table, which proves and verifies the above assumption from another perspective.

The column counts for grouping and summarizing do not have great impact on performance. This is because the syntax for this case is quite simple, and most time is spent on hard disk access but not the data computing. However, Hive and Impala are not the procedural languages like esProc. They cannot handle the complex computation and such idle CPU usage becomes common.

In addition, we limited the scope of computational results to a relatively small result set in the above tests. This is because Impala relies heavily on memory, and the big result set will cause the memory overflow. Hive only supports the external storage computation and there is no limitation on memory. Once modified, esProc algorithm can also implement the external storage computation. But the performance will be degraded.

Web: http://www.raqsoft.com/product-esproc

Personal Blog: http://www.datakeyword.blogspot.com/

More Stories By Jessica Qiu

Jessica Qiu is the editor of Raqsoft. She provides press releases for data computation and data analytics.

Latest Stories
SYS-CON Events announced today that SourceForge has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. SourceForge is the largest, most trusted destination for Open Source Software development, collaboration, discovery and download on the web serving over 32 million viewers, 150 million downloads and over 460,000 active development projects each and every month.
The next XaaS is CICDaaS. Why? Because CICD saves developers a huge amount of time. CD is an especially great option for projects that require multiple and frequent contributions to be integrated. But… securing CICD best practices is an emerging, essential, yet little understood practice for DevOps teams and their Cloud Service Providers. The only way to get CICD to work in a highly secure environment takes collaboration, patience and persistence. Building CICD in the cloud requires rigorous ar...
SYS-CON Events announced today that Dasher Technologies will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Dasher Technologies, Inc. ® is a premier IT solution provider that delivers expert technical resources along with trusted account executives to architect and deliver complete IT solutions and services to help our clients execute their goals, plans and objectives. Since 1999, we'v...
As popularity of the smart home is growing and continues to go mainstream, technological factors play a greater role. The IoT protocol houses the interoperability battery consumption, security, and configuration of a smart home device, and it can be difficult for companies to choose the right kind for their product. For both DIY and professionally installed smart homes, developers need to consider each of these elements for their product to be successful in the market and current smart homes.
The session is centered around the tracing of systems on cloud using technologies like ebpf. The goal is to talk about what this technology is all about and what purpose it serves. In his session at 21st Cloud Expo, Shashank Jain, Development Architect at SAP, will touch upon concepts of observability in the cloud and also some of the challenges we have. Generally most cloud-based monitoring tools capture details at a very granular level. To troubleshoot problems this might not be good enough.
In the fast-paced advances and popularity in cloud technology, one of the most critical factors revolves around concerns for security of your critical data. How to assure both your company and your customers they can confidently trust and utilize your cloud environment is most often top on the list. There is a method to evaluating and providing security that exceeds conventional modes of protecting data both within the cloud as well externally on mobile and other devices. With the public failure...
Transforming cloud-based data into a reportable format can be a very expensive, time-intensive and complex operation. As a SaaS platform with more than 30 million global users, Cornerstone OnDemand’s challenge was to create a scalable solution that would improve the time it took customers to access their user data. Our Real-Time Data Warehouse (RTDW) process vastly reduced data time-to-availability from 24 hours to just 10 minutes. In his session at 21st Cloud Expo, Mark Goldin, Chief Technolo...
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
Companies are harnessing data in ways we once associated with science fiction. Analysts have access to a plethora of visualization and reporting tools, but considering the vast amount of data businesses collect and limitations of CPUs, end users are forced to design their structures and systems with limitations. Until now. As the cloud toolkit to analyze data has evolved, GPUs have stepped in to massively parallel SQL, visualization and machine learning.
SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of...
SYS-CON Events announced today that TidalScale, a leading provider of systems and services, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale has been involved in shaping the computing landscape. They've designed, developed and deployed some of the most important and successful systems and services in the history of the computing industry - internet, Ethernet, operating s...
Though cloud is the future of enterprise computing, a smooth transition of legacy applications and systems is critical for seamless business operations. IT professionals are eager to start leveraging the cost, scale and other benefits of cloud, but with massive investments already in place in existing infrastructure and a number of compliance and resource hurdles, it can be challenging to move to a cloud-based infrastructure.
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, will go over the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, applicatio...
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
We all know that end users experience the Internet primarily with mobile devices. From an app development perspective, we know that successfully responding to the needs of mobile customers depends on rapid DevOps – failing fast, in short, until the right solution evolves in your customers' relationship to your business. Whether you’re decomposing an SOA monolith, or developing a new application cloud natively, it’s not a question of using microservices – not doing so will be a path to eventual b...