Blog Feed Post

Big Data and Analytics -- Apache Pig

Big Data Analytics Introduction to Pig

1    Apache Pig

In a previous post, we started talking about big data and analytics, and we started with Hadoop installation. in this post we will cover Apache Pig, how to install and configure it, and then illustrates two use cases (Airline, and Movies data sets)

1.1    Overview

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject).
In this section, we will walk through the Pig framework, its installation, and configuration and will apply some use cases

1.2    Tools and Versions

-          Apache Pig 0.13.0
-          Ubuntu 14.04 LTS
-          Java 1.7.0_65 (java-7-openjdk-amd64)
-          Hadoop 2.5.1

1.3    Installation and Configurations

1-      Download Pig tar file using:
              wget http://apache.mirrors.pair.com/pig/latest/pig-0.13.0.tar.gz

2-      Untar the file and move its contents to /usr/local/pig
               tar -xzvf pig-0.13.0.tar.gz
               mv pig-013.0/ /usr/local/pig
3-      Edit the bashrc file and add pig to the system path

4-      Now, we can check Pig help command to get list of available commands:
               pig –help
the following will be displayed:

5-      And to start Pig Grunt shell we run the pig command:

1.4    Pig Running Modes:

Pig has two run modes or exectypes:
Ø  Local Mode: To run Pig in local mode. And this can be executed using the following command:

              $pig -x local
              e.g. $pig –x local test.pig

Ø  Mapreduce Mode: To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. And this can be started by the following:
$ pig
$ pig -x mapreduce
e.g. $ pig test.pig
$ pig -x mapreduce test.pig

Using either mode, we can run the Grunt shell, Pig scripts, or embedded programs.

1.5    Use Case 1 (Airline DataSet)

In this use case, we used the airline dataset exist at this location: http://stat-computing.org/dataexpo/2009/the-data.html
In particular the flight data related to year 1987
And here is the Pig Script used to load the dataset file and get the total miles travelled:

records = LOAD '1987.csv' USING PigStorage(',') AS

milage_recs = GROUP records ALL;

tot_miles = FOREACH milage_recs GENERATE SUM(records.Distance);

STORE tot_miles INTO '/home/mohamed/totalmiles';
And here is a screen shot of running this in local mode:

Pig Local mode executed for the year 1987 using the dataset 1987.csv  and the output of the total miles was written after the Pig script execution to the following file:part-r-00000
The results: 775009272

And here are some screen shots for running the same example using MapReduce mode:

But first, we need to copy the airline dataset to the HDFS directory for Pig to process:

hdfs dfs -copyFromLocal /home/mohamed/1987.csv /user/mohamed/1987.csv

1.6    Use Case 2 (Movies DataSet)

In this case I used the movies dataset downloaded from this location:
In the Grunt shell we enter the following commands:
        grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);
        grunt> DUMP movies;
The output will look like the following:

       grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;

       The DUMP of movies_greater_than_four will result the following:

And we can also try the following:
             grunt> sample_10_percent = sample movies 0.1;
             grunt> dump sample_10_percent;
This sample keyword is used to get sample set from data (in this case 10%)

We can also run different other commands over this dataset, like ORDER BY, DISTINCT, GROUP etc.

One last command that I tried is ILUSTRATE, this command is used to view the step-by-step execution of a sequence of statements; hers is an example:

          grunt> ILLUSTRATE movies_greater_than_four;

1.7    Issues and problems:

Ø  The path to the pig script output file after the STORE element should be string quoted not like what mentioned in the Book without quotes, as the Pig Latin compiler complains about that.

Ø  Pig grunt is not working even I ran the export path command, but it worked well when I added the path to bashrc file and sourced it.

      We reached the end of our second post on big data and analytics, hope you enjoyed reading it and experimenting with Pig installation and configuration. next post will be about Apache HBase.

Read the original blog entry...

More Stories By Mohamed El-Refaey

Work as head of research and development at EDC (Egypt Development Center) a member of NTG. previously worked for Qlayer, Acquired by (Sun Microsystems), when my passion about cloud computing domain started. with more than 10 years of experience in software design and development in e-commerce, BPM, EAI, Web 2.0, Banking applications, financial market, Java and J2EE. HIPAA, SOX, BPEL and SOA, and late two year focusing on virtualization technology and cloud computing in studies, technical papers and researches, and international group participation and events. I've been awarded in recognition of innovation and thought leadership while working as IT Specialist at EDS (an HP Company). Also a member of the Cloud Computing Interoperability Forum (CCIF) and member of the UCI (Unified Cloud Interface) open source project, in which he contributed with the project architecture.

Latest Stories
Daniel Jones is CTO of EngineerBetter, helping enterprises deliver value faster. Previously he was an IT consultant, indie video games developer, head of web development in the finance sector, and an award-winning martial artist. Continuous Delivery makes it possible to exploit findings of cognitive psychology and neuroscience to increase the productivity and happiness of our teams.
The standardization of container runtimes and images has sparked the creation of an almost overwhelming number of new open source projects that build on and otherwise work with these specifications. Of course, there's Kubernetes, which orchestrates and manages collections of containers. It was one of the first and best-known examples of projects that make containers truly useful for production use. However, more recently, the container ecosystem has truly exploded. A service mesh like Istio addr...
As DevOps methodologies expand their reach across the enterprise, organizations face the daunting challenge of adapting related cloud strategies to ensure optimal alignment, from managing complexity to ensuring proper governance. How can culture, automation, legacy apps and even budget be reexamined to enable this ongoing shift within the modern software factory? In her Day 2 Keynote at @DevOpsSummit at 21st Cloud Expo, Aruna Ravichandran, VP, DevOps Solutions Marketing, CA Technologies, was jo...
Predicting the future has never been more challenging - not because of the lack of data but because of the flood of ungoverned and risk laden information. Microsoft states that 2.5 exabytes of data are created every day. Expectations and reliance on data are being pushed to the limits, as demands around hybrid options continue to grow.
Poor data quality and analytics drive down business value. In fact, Gartner estimated that the average financial impact of poor data quality on organizations is $9.7 million per year. But bad data is much more than a cost center. By eroding trust in information, analytics and the business decisions based on these, it is a serious impediment to digital transformation.
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
As IoT continues to increase momentum, so does the associated risk. Secure Device Lifecycle Management (DLM) is ranked as one of the most important technology areas of IoT. Driving this trend is the realization that secure support for IoT devices provides companies the ability to deliver high-quality, reliable, secure offerings faster, create new revenue streams, and reduce support costs, all while building a competitive advantage in their markets. In this session, we will use customer use cases...
"NetApp is known as a data management leader but we do a lot more than just data management on-prem with the data centers of our customers. We're also big in the hybrid cloud," explained Wes Talbert, Principal Architect at NetApp, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settlement products to hedge funds and investment banks. After, he co-founded a revenue cycle management company where he learned about Bitcoin and eventually Ethereal. Andrew's role at ConsenSys Enterprise is a mul...
Evan Kirstel is an internationally recognized thought leader and social media influencer in IoT (#1 in 2017), Cloud, Data Security (2016), Health Tech (#9 in 2017), Digital Health (#6 in 2016), B2B Marketing (#5 in 2015), AI, Smart Home, Digital (2017), IIoT (#1 in 2017) and Telecom/Wireless/5G. His connections are a "Who's Who" in these technologies, He is in the top 10 most mentioned/re-tweeted by CMOs and CIOs (2016) and have been recently named 5th most influential B2B marketeer in the US. H...
The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of bus...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...
DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.