Welcome!

Blog Feed Post

Big Data and Analytics -- Apache Pig

Big Data Analytics Introduction to Pig

1    Apache Pig

In a previous post, we started talking about big data and analytics, and we started with Hadoop installation. in this post we will cover Apache Pig, how to install and configure it, and then illustrates two use cases (Airline, and Movies data sets)

1.1    Overview

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject).
In this section, we will walk through the Pig framework, its installation, and configuration and will apply some use cases

1.2    Tools and Versions

-          Apache Pig 0.13.0
-          Ubuntu 14.04 LTS
-          Java 1.7.0_65 (java-7-openjdk-amd64)
-          Hadoop 2.5.1

1.3    Installation and Configurations

1-      Download Pig tar file using:
              wget http://apache.mirrors.pair.com/pig/latest/pig-0.13.0.tar.gz

2-      Untar the file and move its contents to /usr/local/pig
               tar -xzvf pig-0.13.0.tar.gz
               mv pig-013.0/ /usr/local/pig
3-      Edit the bashrc file and add pig to the system path


4-      Now, we can check Pig help command to get list of available commands:
               pig –help
the following will be displayed:




5-      And to start Pig Grunt shell we run the pig command:
             Pig





1.4    Pig Running Modes:

Pig has two run modes or exectypes:
Ø  Local Mode: To run Pig in local mode. And this can be executed using the following command:

              $pig -x local
              e.g. $pig –x local test.pig

Ø  Mapreduce Mode: To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. And this can be started by the following:
$ pig
Or
$ pig -x mapreduce
e.g. $ pig test.pig
or
$ pig -x mapreduce test.pig

Using either mode, we can run the Grunt shell, Pig scripts, or embedded programs.

1.5    Use Case 1 (Airline DataSet)

In this use case, we used the airline dataset exist at this location: http://stat-computing.org/dataexpo/2009/the-data.html
In particular the flight data related to year 1987
And here is the Pig Script used to load the dataset file and get the total miles travelled:

records = LOAD '1987.csv' USING PigStorage(',') AS
                (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance:int,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay);

milage_recs = GROUP records ALL;

tot_miles = FOREACH milage_recs GENERATE SUM(records.Distance);

STORE tot_miles INTO '/home/mohamed/totalmiles';
And here is a screen shot of running this in local mode:



Pig Local mode executed for the year 1987 using the dataset 1987.csv  and the output of the total miles was written after the Pig script execution to the following file:part-r-00000
The results: 775009272


And here are some screen shots for running the same example using MapReduce mode:

But first, we need to copy the airline dataset to the HDFS directory for Pig to process:

hdfs dfs -copyFromLocal /home/mohamed/1987.csv /user/mohamed/1987.csv
















1.6    Use Case 2 (Movies DataSet)

In this case I used the movies dataset downloaded from this location:
In the Grunt shell we enter the following commands:
        grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);
        grunt> DUMP movies;
The output will look like the following:



       grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;

       The DUMP of movies_greater_than_four will result the following:



And we can also try the following:
             grunt> sample_10_percent = sample movies 0.1;
             grunt> dump sample_10_percent;
This sample keyword is used to get sample set from data (in this case 10%)



We can also run different other commands over this dataset, like ORDER BY, DISTINCT, GROUP etc.

One last command that I tried is ILUSTRATE, this command is used to view the step-by-step execution of a sequence of statements; hers is an example:

          grunt> ILLUSTRATE movies_greater_than_four;





1.7    Issues and problems:


Ø  The path to the pig script output file after the STORE element should be string quoted not like what mentioned in the Book without quotes, as the Pig Latin compiler complains about that.

Ø  Pig grunt is not working even I ran the export path command, but it worked well when I added the path to bashrc file and sourced it.

      We reached the end of our second post on big data and analytics, hope you enjoyed reading it and experimenting with Pig installation and configuration. next post will be about Apache HBase.




Read the original blog entry...

More Stories By Mohamed El-Refaey

Work as head of research and development at EDC (Egypt Development Center) a member of NTG. previously worked for Qlayer, Acquired by (Sun Microsystems), when my passion about cloud computing domain started. with more than 10 years of experience in software design and development in e-commerce, BPM, EAI, Web 2.0, Banking applications, financial market, Java and J2EE. HIPAA, SOX, BPEL and SOA, and late two year focusing on virtualization technology and cloud computing in studies, technical papers and researches, and international group participation and events. I've been awarded in recognition of innovation and thought leadership while working as IT Specialist at EDS (an HP Company). Also a member of the Cloud Computing Interoperability Forum (CCIF) and member of the UCI (Unified Cloud Interface) open source project, in which he contributed with the project architecture.

Latest Stories
DevOps at Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to w...
In his session at @ThingsExpo, Greg Gorman is the Director, IoT Developer Ecosystem, Watson IoT, will provide a short tutorial on Node-RED, a Node.js-based programming tool for wiring together hardware devices, APIs and online services in new and interesting ways. It provides a browser-based editor that makes it easy to wire together flows using a wide range of nodes in the palette that can be deployed to its runtime in a single-click. There is a large library of contributed nodes that help so...
What is the best strategy for selecting the right offshore company for your business? In his session at 21st Cloud Expo, Alan Winters, U.S. Head of Business Development at MobiDev, will discuss the things to look for - positive and negative - in evaluating your options. He will also discuss how to maximize productivity with your offshore developers. Before you start your search, clearly understand your business needs and how that impacts software choices.
IBM helps FinTechs and financial services companies build and monetize cognitive-enabled financial services apps quickly and at scale. Hosted on IBM Bluemix, IBM’s platform builds in customer insights, regulatory compliance analytics and security to help reduce development time and testing. In his session at 21st Cloud Expo, Lennart Frantzell, a Developer Advocate with IBM, will discuss how these tools simplify the time-consuming tasks of selection, mapping and data integration, allowing devel...
SYS-CON Events announced today that Cedexis will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Cedexis is the leader in data-driven enterprise global traffic management. Whether optimizing traffic through datacenters, clouds, CDNs, or any combination, Cedexis solutions drive quality and cost-effectiveness.
SYS-CON Events announced today that Mobile Create USA will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Mobile Create USA Inc. is an MVNO-based business model that uses portable communication devices and cellular-based infrastructure in the development, sales, operation and mobile communications systems incorporating GPS capabi...
While some developers care passionately about how data centers and clouds are architected, for most, it is only the end result that matters. To the majority of companies, technology exists to solve a business problem, and only delivers value when it is solving that problem. 2017 brings the mainstream adoption of containers for production workloads. In his session at 21st Cloud Expo, Ben McCormack, VP of Operations at Evernote, will discuss how data centers of the future will be managed, how th...
There is huge complexity in implementing a successful digital business that requires efficient on-premise and cloud back-end infrastructure, IT and Internet of Things (IoT) data, analytics, Machine Learning, Artificial Intelligence (AI) and Digital Applications. In the data center alone, there are physical and virtual infrastructures, multiple operating systems, multiple applications and new and emerging business and technological paradigms such as cloud computing and XaaS. And then there are pe...
Why Federal cloud? What is in Federal Clouds and integrations? This session will identify the process and the FedRAMP initiative. But is it sufficient? What is the remedy for keeping abreast of cutting-edge technology? In his session at 21st Cloud Expo, Rasananda Behera will examine the proposed solutions: Private or public or hybrid cloud Responsible governing bodies How can we accomplish?
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
SYS-CON Events announced today that Keisoku Research Consultant Co. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Keisoku Research Consultant, Co. offers research and consulting in a wide range of civil engineering-related fields from information construction to preservation of cultural properties. For more information, vi...
Today most companies are adopting or evaluating container technology - Docker in particular - to speed up application deployment, drive down cost, ease management and make application delivery more flexible overall. As with most new architectures, this dream takes significant work to become a reality. Even when you do get your application componentized enough and packaged properly, there are still challenges for DevOps teams to making the shift to continuous delivery and achieving that reducti...
Most of the time there is a lot of work involved to move to the cloud, and most of that isn't really related to AWS or Azure or Google Cloud. Before we talk about public cloud vendors and DevOps tools, there are usually several technical and non-technical challenges that are connected to it and that every company needs to solve to move to the cloud. In his session at 21st Cloud Expo, Stefano Bellasio, CEO and founder of Cloud Academy Inc., will discuss what the tools, disciplines, and cultural...
SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of...
SYS-CON Events announced today that Enroute Lab will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enroute Lab is an industrial design, research and development company of unmanned robotic vehicle system. For more information, please visit http://elab.co.jp/.