codeCrunch

Image
  • Downloads
  • rss
  • archive
  • 6 Steps to Setup and Run Hadoop Job

    1. Download/Install Hadoop

    Mac OS X

    brew install hadoop
    

    Linux

    $ cd /usr/local
    $ wget http://ftp.wayne.edu/apache/hadoop/core/hadoop-2.7.0/hadoop-2.7.0.tar.gz
    $ sudo tar xzf hadoop-2.7.0.tar.gz
    $ sudo mv hadoop-2.7.0 hadoop
    

    2. Configure Hadoop

    hadoop configuration

    Edit hadoop-env.sh

    File can be located at

    /usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/hadoop-env.sh (in case of OS X and hadoop version 2.7.0)
    

    open the file & change (extra java runtime options)

    export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
    

    to

    export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
    

    Edit core-site.xml

    File can be located at

    /usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/core-site.xml (in case of OS X and hadoop version 2.7.0)
    

    Add the following site specific property overrides in this file

    <configuration>
      <property>
        <name>hadoop.tmp.dir</name>  
        <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
        <description>A base for other temporary directories.</description>
      </property>
      <property>
        <name>fs.default.name</name>                                     
        <value>hdfs://localhost:9000</value>                             
      </property>                                                        
    </configuration>
    

    Edit mapred-site.xml

    The file can be located at

    /usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/mapred-site.xml
    

    Open and add following property

    <configuration>
        <property>
            <name>mapred.job.tracker</name>
            <value>localhost:9010</value>
        </property>
    </configuration>
    

    Edit hdfs-site.xml

    The file is located at

    /usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/hdfs-site.xml
    

    Open it and add following property

    <configuration>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
    </configuration> 
    

    Now the hadoop installation provides us the following two shell executables to start hadoop with

    /usr/local/Cellar/hadoop/2.7.0/sbin/start-dfs.sh
    /usr/local/Cellar/hadoop/2.7.0/sbin/start-yarn.sh
    

    And stop it with

    /usr/local/Cellar/hadoop/2.7.0/sbin/stop-yarn.sh
    /usr/local/Cellar/hadoop/2.7.0/sbin/stop-dfs.sh
    

    Just to make life easy, add following to your bash profile or zshrc (if you are using zsh)

    alias hstart="/usr/local/Cellar/hadoop/2.7.0/sbin/start-dfs.sh;/usr/local/Cellar/hadoop/2.7.0/sbin/start-yarn.sh"
    alias hstop="/usr/local/Cellar/hadoop/2.7.0/sbin/stop-yarn.sh;/usr/local/Cellar/hadoop/2.7.0/sbin/stop-dfs.sh"
    

    And source it to update the terminal. Now you will be able to start hadoop with hstart and stop with hstop

    Now before running, format the hadoop file system (HDFS) with the following command.

    hdfs namenode -format
    

    SSH Localhost

    Make sure you already have generated the ssh keys. If not you can generate it with the command ssh-keygen -t rsa

    Enable Remote Login

    System Preferences ~> Sharing. Check Remote Login

    Authorize SSH Keys

    To allow your system to accept login, we have to make it aware of the keys that will be used (make ssh passwordless, in hadoop there is communication going on between master node and the slave nodes and it needs to be seamless)

    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    

    And you should be able to run following command ssh localhost without any password authentication

    3. Running Hadoop

    Start hadoop with created alias ‘hstart’

    Now you can start map reduce job like below:

    hadoop jar alsobought-prediction-hybrid/target/scala-2.10/AlsoBought-assembly-1.0.jar AlsoBoughtPrediction /user/kxhitiz/input /user/kxhitiz/output
    

    4. Write Map Reduce Program

    Write your map reduce program.

    5. Compilation/Creating jar

    Once the program is written, compile and create the jar file.

    Now it’s time to start the job, if you have a separate user for hadoop installation (which some people prefers) then you can use secure copy (SCP) to send it to hadoop user directory and run hadoop job there.

    or simply you can just run it within current user.

    6. Running hadoop job

    After that, you can start the job with following command. hadoop jar [yourjarfile] [baseclass] [inputpath] [outputpath]

    But before that make sure you have the input file in hdfs cluster

    hdfs dfs -put file02 /user/kxhitiz/input/
    

    and the output is not already there.

    hdfs dfs -rm -r /user/kxhitiz/output
    

    And then run the map reduce job

    hadoop jar AlsoBought-Hybrid-assembly.jar AlsoBoughtPrediction /user/kxhitiz/input /user/kxhitiz/output
    

    And the output is views as

    hdfs dfs -cat /user/kxhitiz/output/part-r-00000
    

    Sample Code Repository

    Code is hosted on Github. Get it here

    • 10 years ago
    • #hadoop
    • #big data
    0 Comments
  • codeCrunch turned 2 today!

    codeCrunch turned 2 today!

    Source: assets
    • 12 years ago
    • 1 notes
    • #tumblr birthday
    • #tumblr milestone
    1 Comments
  • Mongodb dump and restore specific collections

    So you might come into situation where you just wanna dump specific collections from mongo db and restore it.

    Mongo provides mongodump and mongorestore command but the limitation is that, you either have to dump the whole database or a specific one (only one at a time).

    So I came into the similar situation, and here is what I did in my case.

    I just created created a simple method to restore db collection and ran it on rails console.

    collections = %w(workers profiles accounts)
    path = "path_to_dumped_data"
    
    def restore_collections(collections, path)
      collections.each do |collection|
        command = "mongorestore --collection #{collection} --db my_database_name #{path}/#{collection}.bson"
        system(command)
      end
    end
    
    restore_collections(collections, path)
    

    Did same way to dump the data before restoring it. :)

    • 12 years ago
    • 1 notes
    • #Mongodb
    • #db_backup
    • #ruby
    • #rails
    1 Comments
  • PostgreSQL - Backup & Restore database

    Creating backup dump:

    pg_dump faves_development -U faves -h localhost -F c > db_backup.dump
    
    • Here:
      • faves_development: db name
      • faves: username
      • localhost: hostname
      • -F c : this option specifies the format (custom)

    db dump will be created in db_backup.dump file

    Restore db from dump file:

    psql faves_development -h localhost -U faves < db_backup.dump
    

    the backup dump from db_backup.dump will be restores in faves_development database.

    Situation might arise when you don’t want password prompt. From example you might want to use crontab for regular file backup and hence want to provide password within the command.

    One of many ways to accompolish this is to just create .pgpass file in home directory, and provide password in it. The file should contain information in following format:

    hostname:port:database:username:password
    
    • 12 years ago
    • #postgresql
    • #sql
    • #database
    0 Comments
© 2011–2026 codeCrunch
Next page
  • Page 1 / 7
    Advertisement
    Advertisement