Welcome!

Article

Performance Comparison Testing of Hive, esProc, and Impala Part 1

Three data computing languages

Performance comparison within Hive, Impala and esProc in grouping

summarizing, and join computing.

Hardware environment

PC count: 4
CPU: Intel Core i5 2500 (4 cores)
RAM: 16G
HDD: 2T/7200 rpm
Ethernet adapter: 1000 M

Software environment

OS: CentOS6. 4
JDK: 1. 6
Hadoop/hdfs 2. 2. 0

Test Result

Hive  0. 11. 0
esProc 3. 1
Impala 1. 2. 0

Data sampling

1. Restart PC before every test
2. Print the start time in the log before executing task
3. Print the end time in the log after executing task
4. Subtract the starting time from the ending time as the reference result
5. Repeat the step 1-4 for three times, and get the average value of the reference result as the final result of the test of this round

Test scenario

In order to ensure the test data is typical and comparable, the three products must go through the same computing. The Hive or Impala is designed for the data warehouse, providing the SQL-like syntax as the only available syntax. By comparison, esProc is designed as the complex procedural computing script, but not the data warehouse. In other words, esProc does not provide the SQL -style syntax directly, and esProc script can achieve the result of SQL computing by simulating in a more convenient style. So, the test computation this time is the SQL-style grouping, summarizing, and join operations.

In this test report, we use the HDFS and Hive incorporated in CDH5.0beta, while not the Hadoop that issued separately. This is because the Hadoop deployment and setup is rather complex, and the testing environment can frequently go wrong. But it is comparatively easy for CDH. esProc is easy to setup with an installation package of dozens MBs.

esProc supports both HDFS and the much faster operations on local disks, while Hive or Impala only supports HDFS. In order to test the extreme performances of these three solutions, esProc use the local disk for test, and split the data into several files and distribute them on several machines in advance, while Hive or Impala uses HDFS.

Grouping and Summarizing Test for Narrow Table

Data sample:
Table name: p_narrow
Col. count: 11
Row count: 500 million rows
Space occupied if saving as text: 120. 6G.
Data structure: personid int,name string,sex int,cityid int,birthday int,degree int,col1 string,col2 int,col3 int,col4 int,col5 string
Test case:
1.1 col. to group & 1 col. to summarize
Hive: select personid%10000, sum(col3) from p_narrow group by personid%10000
esProc: The codes fall into 3 parts. They are respectively: Program of summary machine, main program for node machine, and subprogram for node machine.

 

 


Impala: select personid%10000, sum(col3) from p_narrow group by personid%10000

2. 1 col. to group & 4 col. to summarize

Hive: select personid%10, count(col1), max(col2), sum(col3), count(col5) from p_narrow group by personid%10
esProc: The program for summary machine in cell A4 is changed to:
=A3. groups(personid: personid;count(cul1count): cul1count,max(cul2count): cul2count,sum(cul3sum): cul3sum,count(cul5): cul5count)
The main program for node machine in cell A5 is:
=A4. [email protected](personid: [email protected](personid: cu1count,max(col2count): cul2count,sum(col3sum): cul3sum,count(col5): cul5count)
The main program for node machine in cell A1 is:
=cursor. groups(personid%10000: personid; count(col1count): co1count, max(col2count): col2count, sum(col3sum): col3sum,count(col5): col5count)
Impala: select personid%10, count(col1), max(col2), sum(col3), count(col5) from p_narrow group by personid%10

3. 4 col. to group & 1 col. to summarize

Hive: select personid%10, cityid%10, birthdayid%10, col4%10 from p_narrow group by personid%10,cityid%10,birthdayid%10,col4%10
esProc: The program for summary machine in cell A4 is changed to:
=A3. groups(personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; sum(cul3sum): cul3sum)
The main program for node machine in cell A5 is changed to:
=A4. [email protected](personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; sum(col3sum): cul3sum)
The main program for node machine in cell A1 is changed to:
=cursor. groups(personid%10: personid, cityid%10: cityid, birthdayid%10: birthdayid, col4%10: col4; sum(col3sum): col3sum)
Impala: select personid%10, cityid%10, birthdayid%10, col4%10 from p_narrow group by personid%10,cityid%10,birthdayid%10,col4%10

4.4 col. to group & 4 col. to summarize

Hive: select personid%10, cityid%10, birthdayid%10, col4%10, count(col1), max(col2), sum(col3), count(col5) from p_narrow group by personid%10,cityid%10,birthdayid%10,col4%10
esProc: The program for summary machine in cell A4 is changed to:
=A3. groups(personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; count(cul1count): cul1count,max(cul2count): cul2count,sum(cul3sum): cul3sum,count(cul5): cul5count)
The main program for node machine in cell A5 is changed to:
=A4. [email protected](personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; count(col1count): cu1count,max(col2count): cul2count,sum(col3sum): cul3sum,count(col5): cul5count)
The main program for node machine in cell A1 is changed to:
=cursor. groups(personid%10: personid, cityid%10: cityid, birthdayid%10: birthdayid, col4%10: col4; count(col1count): co1count, max(col2count): col2count, sum(col3sum): col3sum, count(col5): col5count)
Impala: select personid%10, cityid%10, birthdayid%10, col4%10, count(col1), max(col2), sum(col3), count(col5) from p_narrow group by personid%10,cityid%10,birthdayid%10,col4%10
Test results:

Test results:


Grouping and summarizing test for wide table

Data sample:
Table name: p
Col. count: 106
Row count: 60 million
Space occupied if saving as text: 127. 9G.
Data structure: personid int,name string,sex int,cityid int,birthday int,degree int,col1 int,col2 int,col3 int,col4 int,col5 int,col6 int,col7 int,col8 int,col9 int,col10 int,col11 int,col12 int,col13 int,col14 int,col15 int,col16 int,col17 int,col18 int,col19 int,col20 int,col21 int,col22 int,col23 int,col24 int,col25 int,col26 int,col27 int,col28 int,col29 int,col30 int,col31 int,col32 int,col33 int,col34 int,col35 int,col36 int,col37 int,col38 int,col39 int,col40 int,col41 int,col42 int,col43 int,col44 int,col45 int,col46 int,col47 int,col48 int,col49 int,col50 int,col51 int,col52 int,col53 int,col54 int,col55 int,col56 int,col57 int,col58 int,col59 int,col60 int,col61 int,col62 int,col63 int,col64 int,col65 int,col66 int,col67 int,col68 int,col69 int,col70 int,col71 int,col72 int,col73 int,col74 int,col75 int,col76 int,col77 int,col78 int,col79 int,col80 int,col81 int,col82 int,col83 int,col84 string,col85 string,col86 string,col87 string,col88 string,col89 string,col90 string,col91 string,col92 string,col93 string,col94 string,col95 string,col96 string,col97 string,col98 string,col99 string,col100 string

Test case:
1.1 col. to group & 1 col. to summarize
Hive: select personid%10000, sum(col3) from p group by personid%10000
esProc: The codes can be divided into 3 parts. They are respectively: Program for summary machine, main program for node machine, and subprogram for node machine.

 

 


Impala: select personid%10000, sum(col3) from p group by personid%10000

2.1 col. to group & 4 col. to summarize

Hive: select personid%10, count(col1), max(col2), sum(col3), count(col5) from p group by personid%10
esProc: The program for summary machine in cell A4 is changed to:
=A3. groups(personid: personid;count(cul1count): cul1count,max(cul2count): cul2count,sum(cul3sum): cul3sum,count(cul5): cul5count)
The main program for node machine in cell A5 is changed to:
=A4. [email protected](personid: personid;count(col1count): cu1count,max(col2count): cul2count,sum(col3sum): cul3sum,count(col5): cul5count)
The main program for node machine in cell A1 is changed to:
=cursor. groups(personid%10000: personid; count(col1count): co1count, max(col2count): col2count, sum(col3sum): col3sum,count(col5): col5count)
Impala: select personid%10, count(col1), max(col2), sum(col3), count(col5) from p group by personid%10

3.4 col. to group & 1 col. to summarize

Hive: select personid%10, cityid%10, birthdayid%10, col4%10 from p group by personid%10,cityid%10,birthdayid%10,col4%10
esProc: The program for summary machine in cell A4 is changed to:
=A3. groups(personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; sum(cul3sum): cul3sum)
The main program for node machine in cell A5 is changed to:
=A4. [email protected](personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; sum(col3sum): cul3sum)
The main program for node machine in cell A1 is changed to:
=cursor. groups(personid%10: personid, cityid%10: cityid, birthdayid%10: birthdayid, col4%10: col4; sum(col3sum): col3sum)
Impala: select personid%10, cityid%10, birthdayid%10, col4%10 from p group by personid%10,cityid%10,birthdayid%10,col4%10

4.4 col. to group & 4 col. to summarize

Hive: select personid%10, cityid%10, birthdayid%10, col4%10, count(col1), max(col2), sum(col3), count(col5) from p group by personid%10,cityid%10,birthdayid%10,col4%10
esProc: The program for summary machine in cell A4 is changed to:
=A3. groups(personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; count(cul1count): cul1count,max(cul2count): cul2count,sum(cul3sum): cul3sum,count(cul5): cul5count)
The main program for node machine in cell A5 is changed to:
=A4. [email protected](personid: personid, cityid: cityid, birthdayid: birthdayid, col4: col4; count(col1count): cu1count,max(col2count): cul2count,sum(col3sum): cul3sum,count(col5): cul5count)
The main program for node machine in cell A1 is changed to:
=cursor. groups(personid%10: personid, cityid%10: cityid, birthdayid%10: birthdayid, col4%10: col4; count(col1count): co1count, max(col2count): col2count, sum(col3sum): col3sum, count(col5): col5count)
Impala: select personid%10, cityid%10, birthdayid%10, col4%10, count(col1), max(col2), sum(col3), count(col5) from p group by personid%10,cityid%10,birthdayid%10,col4%10
Test results:


The performance testing and result comparison regarding the join computing will be discussed in the next article: Performance Comparison Testing of Hive, esProc, and Impala Part 2.

Personal blog: http://www.datakeyword.blogspot.com/
Web: http://www.raqsoft.com/product-esproc

More Stories By Jessica Qiu

Jessica Qiu is the editor of Raqsoft. She provides press releases for data computation and data analytics.

Latest Stories
In the enterprise today, connected IoT devices are everywhere – both inside and outside corporate environments. The need to identify, manage, control and secure a quickly growing web of connections and outside devices is making the already challenging task of security even more important, and onerous. In his session at @ThingsExpo, Rich Boyer, CISO and Chief Architect for Security at NTT i3, discussed new ways of thinking and the approaches needed to address the emerging challenges of security i...
"We're here to tell the world about our cloud-scale infrastructure that we have at Juniper combined with the world-class security that we put into the cloud," explained Lisa Guess, VP of Systems Engineering at Juniper Networks, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"I will be talking about ChatOps and ChatOps as a way to solve some problems in the DevOps space," explained Himanshu Chhetri, CTO of Addteq, in this SYS-CON.tv interview at @DevOpsSummit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
In his session at 20th Cloud Expo, Mike Johnston, an infrastructure engineer at Supergiant.io, discussed how to use Kubernetes to set up a SaaS infrastructure for your business. Mike Johnston is an infrastructure engineer at Supergiant.io with over 12 years of experience designing, deploying, and maintaining server and workstation infrastructure at all scales. He has experience with brick and mortar data centers as well as cloud providers like Digital Ocean, Amazon Web Services, and Rackspace. H...
"We are an IT services solution provider and we sell software to support those solutions. Our focus and key areas are around security, enterprise monitoring, and continuous delivery optimization," noted John Balsavage, President of A&I Solutions, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
What sort of WebRTC based applications can we expect to see over the next year and beyond? One way to predict development trends is to see what sorts of applications startups are building. In his session at @ThingsExpo, Arin Sime, founder of WebRTC.ventures, discussed the current and likely future trends in WebRTC application development based on real requests for custom applications from real customers, as well as other public sources of information.
Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, provided a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services with...
The financial services market is one of the most data-driven industries in the world, yet it’s bogged down by legacy CPU technologies that simply can’t keep up with the task of querying and visualizing billions of records. In his session at 20th Cloud Expo, Karthik Lalithraj, a Principal Solutions Architect at Kinetica, discussed how the advent of advanced in-database analytics on the GPU makes it possible to run sophisticated data science workloads on the same database that is housing the rich...
DevOps at Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to w...
SYS-CON Events announced today that Massive Networks will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Massive Networks mission is simple. To help your business operate seamlessly with fast, reliable, and secure internet and network solutions. Improve your customer's experience with outstanding connections to your cloud.
"We want to show that our solution is far less expensive with a much better total cost of ownership so we announced several key features. One is called geo-distributed erasure coding, another is support for KVM and we introduced a new capability called Multi-Part," explained Tim Desai, Senior Product Marketing Manager at Hitachi Data Systems, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
FinTechs use the cloud to operate at the speed and scale of digital financial activity, but are often hindered by the complexity of managing security and compliance in the cloud. In his session at 20th Cloud Expo, Sesh Murthy, co-founder and CTO of Cloud Raxak, showed how proactive and automated cloud security enables FinTechs to leverage the cloud to achieve their business goals. Through business-driven cloud security, FinTechs can speed time-to-market, diminish risk and costs, maintain continu...
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol,...
DX World EXPO, LLC., a Lighthouse Point, Florida-based startup trade show producer and the creator of "DXWorldEXPO® - Digital Transformation Conference & Expo" has announced its executive management team. The team is headed by Levent Selamoglu, who has been named CEO. "Now is the time for a truly global DX event, to bring together the leading minds from the technology world in a conversation about Digital Transformation," he said in making the announcement.
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...