Welcome!

Article

Performance Comparison Testing of Hive, esProc, and Impala Part 1

Three data computing languages

Performance comparison within Hive, Impala and esProc in grouping

summarizing, and join computing.

Hardware environment

PC count: 4
CPU: Intel Core i5 2500 (4 cores)
RAM: 16G
HDD: 2T/7200 rpm
Ethernet adapter: 1000 M

Software environment

OS: CentOS6. 4
JDK: 1. 6
Hadoop/hdfs 2. 2. 0

Test Result

Hive  0. 11. 0
esProc 3. 1
Impala 1. 2. 0

Data sampling

1. Restart PC before every test
2. Print the start time in the log before executing task
3. Print the end time in the log after executing task
4. Subtract the starting time from the ending time as the reference result
5. Repeat the step 1-4 for three times, and get the average value of the reference result as the final result of the test of this round

Test scenario

In order to ensure the test data is typical and comparable, the three products must go through the same computing. The Hive or Impala is designed for the data warehouse, providing the SQL-like syntax as the only available syntax. By comparison, esProc is designed as the complex procedural computing script, but not the data warehouse. In other words, esProc does not provide the SQL -style syntax directly, and esProc script can achieve the result of SQL computing by simulating in a more convenient style. So, the test computation this time is the SQL-style grouping, summarizing, and join operations.

In this test report, we use the HDFS and Hive incorporated in CDH5.0beta, while not the Hadoop that issued separately. This is because the Hadoop deployment and setup is rather complex, and the testing environment can frequently go wrong. But it is comparatively easy for CDH. esProc is easy to setup with an installation package of dozens MBs.

esProc supports both HDFS and the much faster operations on local disks, while Hive or Impala only supports HDFS. In order to test the extreme performances of these three solutions, esProc use the local disk for test, and split the data into several files and distribute them on several machines in advance, while Hive or Impala uses HDFS.

Grouping and Summarizing Test for Narrow Table

Data sample:
Table name: p_narrow
Col. count: 11
Row count: 500 million rows
Space occupied if saving as text: 120. 6G.
Data structure: personid int,name string,sex int,cityid int,birthday int,degree int,col1 string,col2 int,col3 int,col4 int,col5 string
Test case:
1.1 col. to group & 1 col. to summarize
Hive: select personid%10000, sum(col3) from p_narrow group by personid%10000
esProc: The codes fall into 3 parts. They are respectively: Program of summary machine, main program for node machine, and subprogram for node machine.

 

 


Impala: select personid%10000, sum(col3) from p_narrow group by personid%10000

2. 1 col. to group & 4 col. to summarize

Hive: select personid%10, count(col1), max(col2), sum(col3), count(col5) from p_narrow group by personid%10
esProc: The program for summary machine in cell A4 is changed to:
=A3. groups(personid: personid;count(cul1count): cul1count,max(cul2count): cul2count,sum(cul3sum): cul3sum,count(cul5): cul5count)
The main program for node machine in cell A5 is:
=A4. groups@o(personid: groups@o(personid: cu1count,max(col2count): cul2count,sum(col3sum): cul3sum,count(col5): cul5count)
The main program for node machine in cell A1 is:
=cursor. groups(personid%10000: personid; count(col1count): co1count, max(col2count): col2count, sum(col3sum): col3sum,count(col5): col5count)
Impala: select personid%10, count(col1), max(col2), sum(col3), count(col5) from p_narrow group by personid%10

3. 4 col. to group & 1 col. to summarize

Hive: select personid%10, cityid%10, birthdayid%10, col4%10 from p_narrow group by personid%10,cityid%10,birthdayid%10,col4%10
esProc: The program for summary machine in cell A4 is changed to:
=A3. groups(personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; sum(cul3sum): cul3sum)
The main program for node machine in cell A5 is changed to:
=A4. groups@o(personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; sum(col3sum): cul3sum)
The main program for node machine in cell A1 is changed to:
=cursor. groups(personid%10: personid, cityid%10: cityid, birthdayid%10: birthdayid, col4%10: col4; sum(col3sum): col3sum)
Impala: select personid%10, cityid%10, birthdayid%10, col4%10 from p_narrow group by personid%10,cityid%10,birthdayid%10,col4%10

4.4 col. to group & 4 col. to summarize

Hive: select personid%10, cityid%10, birthdayid%10, col4%10, count(col1), max(col2), sum(col3), count(col5) from p_narrow group by personid%10,cityid%10,birthdayid%10,col4%10
esProc: The program for summary machine in cell A4 is changed to:
=A3. groups(personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; count(cul1count): cul1count,max(cul2count): cul2count,sum(cul3sum): cul3sum,count(cul5): cul5count)
The main program for node machine in cell A5 is changed to:
=A4. groups@o(personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; count(col1count): cu1count,max(col2count): cul2count,sum(col3sum): cul3sum,count(col5): cul5count)
The main program for node machine in cell A1 is changed to:
=cursor. groups(personid%10: personid, cityid%10: cityid, birthdayid%10: birthdayid, col4%10: col4; count(col1count): co1count, max(col2count): col2count, sum(col3sum): col3sum, count(col5): col5count)
Impala: select personid%10, cityid%10, birthdayid%10, col4%10, count(col1), max(col2), sum(col3), count(col5) from p_narrow group by personid%10,cityid%10,birthdayid%10,col4%10
Test results:

Test results:


Grouping and summarizing test for wide table

Data sample:
Table name: p
Col. count: 106
Row count: 60 million
Space occupied if saving as text: 127. 9G.
Data structure: personid int,name string,sex int,cityid int,birthday int,degree int,col1 int,col2 int,col3 int,col4 int,col5 int,col6 int,col7 int,col8 int,col9 int,col10 int,col11 int,col12 int,col13 int,col14 int,col15 int,col16 int,col17 int,col18 int,col19 int,col20 int,col21 int,col22 int,col23 int,col24 int,col25 int,col26 int,col27 int,col28 int,col29 int,col30 int,col31 int,col32 int,col33 int,col34 int,col35 int,col36 int,col37 int,col38 int,col39 int,col40 int,col41 int,col42 int,col43 int,col44 int,col45 int,col46 int,col47 int,col48 int,col49 int,col50 int,col51 int,col52 int,col53 int,col54 int,col55 int,col56 int,col57 int,col58 int,col59 int,col60 int,col61 int,col62 int,col63 int,col64 int,col65 int,col66 int,col67 int,col68 int,col69 int,col70 int,col71 int,col72 int,col73 int,col74 int,col75 int,col76 int,col77 int,col78 int,col79 int,col80 int,col81 int,col82 int,col83 int,col84 string,col85 string,col86 string,col87 string,col88 string,col89 string,col90 string,col91 string,col92 string,col93 string,col94 string,col95 string,col96 string,col97 string,col98 string,col99 string,col100 string

Test case:
1.1 col. to group & 1 col. to summarize
Hive: select personid%10000, sum(col3) from p group by personid%10000
esProc: The codes can be divided into 3 parts. They are respectively: Program for summary machine, main program for node machine, and subprogram for node machine.

 

 


Impala: select personid%10000, sum(col3) from p group by personid%10000

2.1 col. to group & 4 col. to summarize

Hive: select personid%10, count(col1), max(col2), sum(col3), count(col5) from p group by personid%10
esProc: The program for summary machine in cell A4 is changed to:
=A3. groups(personid: personid;count(cul1count): cul1count,max(cul2count): cul2count,sum(cul3sum): cul3sum,count(cul5): cul5count)
The main program for node machine in cell A5 is changed to:
=A4. groups@o(personid: personid;count(col1count): cu1count,max(col2count): cul2count,sum(col3sum): cul3sum,count(col5): cul5count)
The main program for node machine in cell A1 is changed to:
=cursor. groups(personid%10000: personid; count(col1count): co1count, max(col2count): col2count, sum(col3sum): col3sum,count(col5): col5count)
Impala: select personid%10, count(col1), max(col2), sum(col3), count(col5) from p group by personid%10

3.4 col. to group & 1 col. to summarize

Hive: select personid%10, cityid%10, birthdayid%10, col4%10 from p group by personid%10,cityid%10,birthdayid%10,col4%10
esProc: The program for summary machine in cell A4 is changed to:
=A3. groups(personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; sum(cul3sum): cul3sum)
The main program for node machine in cell A5 is changed to:
=A4. groups@o(personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; sum(col3sum): cul3sum)
The main program for node machine in cell A1 is changed to:
=cursor. groups(personid%10: personid, cityid%10: cityid, birthdayid%10: birthdayid, col4%10: col4; sum(col3sum): col3sum)
Impala: select personid%10, cityid%10, birthdayid%10, col4%10 from p group by personid%10,cityid%10,birthdayid%10,col4%10

4.4 col. to group & 4 col. to summarize

Hive: select personid%10, cityid%10, birthdayid%10, col4%10, count(col1), max(col2), sum(col3), count(col5) from p group by personid%10,cityid%10,birthdayid%10,col4%10
esProc: The program for summary machine in cell A4 is changed to:
=A3. groups(personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; count(cul1count): cul1count,max(cul2count): cul2count,sum(cul3sum): cul3sum,count(cul5): cul5count)
The main program for node machine in cell A5 is changed to:
=A4. groups@o(personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; count(col1count): cu1count,max(col2count): cul2count,sum(col3sum): cul3sum,count(col5): cul5count)
The main program for node machine in cell A1 is changed to:
=cursor. groups(personid%10: personid, cityid%10: cityid, birthdayid%10: birthdayid, col4%10: col4; count(col1count): co1count, max(col2count): col2count, sum(col3sum): col3sum, count(col5): col5count)
Impala: select personid%10, cityid%10, birthdayid%10, col4%10, count(col1), max(col2), sum(col3), count(col5) from p group by personid%10,cityid%10,birthdayid%10,col4%10
Test results:


The performance testing and result comparison regarding the join computing will be discussed in the next article: Performance Comparison Testing of Hive, esProc, and Impala Part 2.

Personal blog: http://www.datakeyword.blogspot.com/
Web: http://www.raqsoft.com/product-esproc

More Stories By Jessica Qiu

Jessica Qiu is the editor of Raqsoft. She provides press releases for data computation and data analytics.

Latest Stories
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo | @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
HyperConvergence came to market with the objective of being simple, flexible and to help drive down operating expenses. It reduced the footprint by bundling the compute/storage/network into one box. This brought a new set of challenges as the HyperConverged vendors are very focused on their own proprietary building blocks. If you want to scale in a certain way, let’s say you identified a need for more storage and want to add a device that is not sold by the HyperConverged vendor, forget about it...
Most companies are adopting or evaluating container technology - Docker in particular - to speed up application deployment, drive down cost, ease management and make application delivery more flexible overall. As with most new architectures, this dream takes a lot of work to become a reality. Even when you do get your application componentized enough and packaged properly, there are still challenges for DevOps teams to making the shift to continuous delivery and achieving that reduction in cost ...
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
FinTech is the sum of financial and technology, and it’s one of the fastest growing tech industries. Total global investments in FinTech almost reached $50 billion last year, but there is still a great deal of confusion over what it is and what it means – especially as it applies to retirement. Building financial startups is not simple, but with the right team, technology and an innovative approach it can be an extremely interesting domain to disrupt. FinTech heralds a financial revolution that...
What if you could build a web application that could support true web-scale traffic without having to ever provision or manage a single server? Sounds magical, and it is! In his session at 20th Cloud Expo, Chris Munns, Senior Developer Advocate for Serverless Applications at Amazon Web Services, will show how to build a serverless website that scales automatically using services like AWS Lambda, Amazon API Gateway, and Amazon S3. We will review several frameworks that can help you build serverle...
Imagine having the ability to leverage all of your current technology and to be able to compose it into one resource pool. Now imagine, as your business grows, not having to deploy a complete new appliance to scale your infrastructure. Also imagine a true multi-cloud capability that allows live migration without any modification between cloud environments regardless of whether that cloud is your private cloud or your public AWS, Azure or Google instance. Now think of a world that is not locked i...
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
SYS-CON Events announced today that Infranics will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Since 2000, Infranics has developed SysMaster Suite, which is required for the stable and efficient management of ICT infrastructure. The ICT management solution developed and provided by Infranics continues to add intelligence to the ICT infrastructure through the IMC (Infra Management Cycle) based on mathemat...
Historically, some banking activities such as trading have been relying heavily on analytics and cutting edge algorithmic tools. The coming of age of powerful data analytics solutions combined with the development of intelligent algorithms have created new opportunities for financial institutions. In his session at 20th Cloud Expo, Sebastien Meunier, Head of Digital for North America at Chappuis Halder & Co., will discuss how these tools can be leveraged to develop a lasting competitive advanta...
MongoDB Atlas leverages VPC peering for AWS, a service that allows multiple VPC networks to interact. This includes VPCs that belong to other AWS account holders. By performing cross account VPC peering, users ensure networks that host and communicate their data are secure. In his session at 20th Cloud Expo, Jay Gordon, a Developer Advocate at MongoDB, will explain how to properly architect your VPC using existing AWS tools and then peer with your MongoDB Atlas cluster. He'll discuss the secur...
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
SYS-CON Events announced today that SD Times | BZ Media has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. BZ Media LLC is a high-tech media company that produces technical conferences and expositions, and publishes a magazine, newsletters and websites in the software development, SharePoint, mobile development and commercial UAV markets.
SYS-CON Events announced today that Juniper Networks (NYSE: JNPR), an industry leader in automated, scalable and secure networks, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Juniper Networks challenges the status quo with products, solutions and services that transform the economics of networking. The company co-innovates with customers and partners to deliver automated, scalable and secure network...