codeCrunch

6 Steps to Setup and Run Hadoop Job

1. Download/Install Hadoop

Mac OS X

brew install hadoop

Linux

$ cd /usr/local
$ wget http://ftp.wayne.edu/apache/hadoop/core/hadoop-2.7.0/hadoop-2.7.0.tar.gz
$ sudo tar xzf hadoop-2.7.0.tar.gz
$ sudo mv hadoop-2.7.0 hadoop

2. Configure Hadoop

hadoop configuration

Edit hadoop-env.sh

File can be located at

/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/hadoop-env.sh (in case of OS X and hadoop version 2.7.0)

open the file & change (extra java runtime options)

export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

to

export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="

Edit core-site.xml

File can be located at

/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/core-site.xml (in case of OS X and hadoop version 2.7.0)

Add the following site specific property overrides in this file

<configuration>
  <property>
    <name>hadoop.tmp.dir</name>  
    <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>
  <property>
    <name>fs.default.name</name>                                     
    <value>hdfs://localhost:9000</value>                             
  </property>                                                        
</configuration>

Edit mapred-site.xml

The file can be located at

/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/mapred-site.xml

Open and add following property

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9010</value>
    </property>
</configuration>

Edit hdfs-site.xml

The file is located at

/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/hdfs-site.xml

Open it and add following property

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Now the hadoop installation provides us the following two shell executables to start hadoop with

/usr/local/Cellar/hadoop/2.7.0/sbin/start-dfs.sh
/usr/local/Cellar/hadoop/2.7.0/sbin/start-yarn.sh

And stop it with

/usr/local/Cellar/hadoop/2.7.0/sbin/stop-yarn.sh
/usr/local/Cellar/hadoop/2.7.0/sbin/stop-dfs.sh

Just to make life easy, add following to your bash profile or zshrc (if you are using zsh)

alias hstart="/usr/local/Cellar/hadoop/2.7.0/sbin/start-dfs.sh;/usr/local/Cellar/hadoop/2.7.0/sbin/start-yarn.sh"
alias hstop="/usr/local/Cellar/hadoop/2.7.0/sbin/stop-yarn.sh;/usr/local/Cellar/hadoop/2.7.0/sbin/stop-dfs.sh"

And source it to update the terminal. Now you will be able to start hadoop with hstart and stop with hstop

Now before running, format the hadoop file system (HDFS) with the following command.

hdfs namenode -format

SSH Localhost

Make sure you already have generated the ssh keys. If not you can generate it with the command ssh-keygen -t rsa

Enable Remote Login

System Preferences ~> Sharing. Check Remote Login

Authorize SSH Keys

To allow your system to accept login, we have to make it aware of the keys that will be used (make ssh passwordless, in hadoop there is communication going on between master node and the slave nodes and it needs to be seamless)

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

And you should be able to run following command ssh localhost without any password authentication

3. Running Hadoop

Start hadoop with created alias ‘hstart’

Now you can start map reduce job like below:

hadoop jar alsobought-prediction-hybrid/target/scala-2.10/AlsoBought-assembly-1.0.jar AlsoBoughtPrediction /user/kxhitiz/input /user/kxhitiz/output

4. Write Map Reduce Program

Write your map reduce program.

5. Compilation/Creating jar

Once the program is written, compile and create the jar file.

Now it’s time to start the job, if you have a separate user for hadoop installation (which some people prefers) then you can use secure copy (SCP) to send it to hadoop user directory and run hadoop job there.

or simply you can just run it within current user.

6. Running hadoop job

After that, you can start the job with following command. hadoop jar [yourjarfile] [baseclass] [inputpath] [outputpath]

But before that make sure you have the input file in hdfs cluster

hdfs dfs -put file02 /user/kxhitiz/input/

and the output is not already there.

hdfs dfs -rm -r /user/kxhitiz/output

And then run the map reduce job

hadoop jar AlsoBought-Hybrid-assembly.jar AlsoBoughtPrediction /user/kxhitiz/input /user/kxhitiz/output

And the output is views as

hdfs dfs -cat /user/kxhitiz/output/part-r-00000

Sample Code Repository

Code is hosted on Github. Get it here

0 Comments

codeCrunch turned 2 today!

Source: assets

1 Comments

Mongodb dump and restore specific collections

So you might come into situation where you just wanna dump specific collections from mongo db and restore it.

Mongo provides mongodump and mongorestore command but the limitation is that, you either have to dump the whole database or a specific one (only one at a time).

So I came into the similar situation, and here is what I did in my case.

I just created created a simple method to restore db collection and ran it on rails console.

collections = %w(workers profiles accounts)
path = "path_to_dumped_data"

def restore_collections(collections, path)
  collections.each do |collection|
    command = "mongorestore --collection #{collection} --db my_database_name #{path}/#{collection}.bson"
    system(command)
  end
end

restore_collections(collections, path)

Did same way to dump the data before restoring it. :)

1 Comments

PostgreSQL - Backup & Restore database

Creating backup dump:

pg_dump faves_development -U faves -h localhost -F c > db_backup.dump

Here:
- faves_development: db name
- faves: username
- localhost: hostname
- -F c : this option specifies the format (custom)

db dump will be created in db_backup.dump file

Restore db from dump file:

psql faves_development -h localhost -U faves < db_backup.dump

the backup dump from db_backup.dump will be restores in faves_development database.

Situation might arise when you don’t want password prompt. From example you might want to use crontab for regular file backup and hence want to provide password within the command.

One of many ways to accompolish this is to just create .pgpass file in home directory, and provide password in it. The file should contain information in following format:

hostname:port:database:username:password

0 Comments