Mac OS X
brew install hadoop
Linux
$ cd /usr/local
$ wget http://ftp.wayne.edu/apache/hadoop/core/hadoop-2.7.0/hadoop-2.7.0.tar.gz
$ sudo tar xzf hadoop-2.7.0.tar.gz
$ sudo mv hadoop-2.7.0 hadoop
Edit hadoop-env.sh
File can be located at
/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/hadoop-env.sh (in case of OS X and hadoop version 2.7.0)
open the file & change (extra java runtime options)
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
to
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
Edit core-site.xml
File can be located at
/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/core-site.xml (in case of OS X and hadoop version 2.7.0)
Add the following site specific property overrides in this file
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Edit mapred-site.xml
The file can be located at
/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/mapred-site.xml
Open and add following property
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9010</value>
</property>
</configuration>
Edit hdfs-site.xml
The file is located at
/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/hdfs-site.xml
Open it and add following property
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Now the hadoop installation provides us the following two shell executables to start hadoop with
/usr/local/Cellar/hadoop/2.7.0/sbin/start-dfs.sh
/usr/local/Cellar/hadoop/2.7.0/sbin/start-yarn.sh
And stop it with
/usr/local/Cellar/hadoop/2.7.0/sbin/stop-yarn.sh
/usr/local/Cellar/hadoop/2.7.0/sbin/stop-dfs.sh
Just to make life easy, add following to your bash profile or zshrc (if you are using zsh)
alias hstart="/usr/local/Cellar/hadoop/2.7.0/sbin/start-dfs.sh;/usr/local/Cellar/hadoop/2.7.0/sbin/start-yarn.sh"
alias hstop="/usr/local/Cellar/hadoop/2.7.0/sbin/stop-yarn.sh;/usr/local/Cellar/hadoop/2.7.0/sbin/stop-dfs.sh"
And source it to update the terminal. Now you will be able to start hadoop with hstart and stop with hstop
Now before running, format the hadoop file system (HDFS) with the following command.
hdfs namenode -format
Make sure you already have generated the ssh keys. If not you can generate it with the command ssh-keygen -t rsa
System Preferences ~> Sharing. Check Remote Login
To allow your system to accept login, we have to make it aware of the keys that will be used (make ssh passwordless, in hadoop there is communication going on between master node and the slave nodes and it needs to be seamless)
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
And you should be able to run following command ssh localhost without any password authentication
Start hadoop with created alias ‘hstart’
Now you can start map reduce job like below:
hadoop jar alsobought-prediction-hybrid/target/scala-2.10/AlsoBought-assembly-1.0.jar AlsoBoughtPrediction /user/kxhitiz/input /user/kxhitiz/output
Write your map reduce program.
Once the program is written, compile and create the jar file.
Now it’s time to start the job, if you have a separate user for hadoop installation (which some people prefers) then you can use secure copy (SCP) to send it to hadoop user directory and run hadoop job there.
or simply you can just run it within current user.
After that, you can start the job with following command. hadoop jar [yourjarfile] [baseclass] [inputpath] [outputpath]
But before that make sure you have the input file in hdfs cluster
hdfs dfs -put file02 /user/kxhitiz/input/
and the output is not already there.
hdfs dfs -rm -r /user/kxhitiz/output
And then run the map reduce job
hadoop jar AlsoBought-Hybrid-assembly.jar AlsoBoughtPrediction /user/kxhitiz/input /user/kxhitiz/output
And the output is views as
hdfs dfs -cat /user/kxhitiz/output/part-r-00000
Code is hosted on Github. Get it here
codeCrunch turned 2 today!
So you might come into situation where you just wanna dump specific collections from mongo db and restore it.
Mongo provides mongodump and mongorestore command but the limitation is that, you either have to dump the whole database or a specific one (only one at a time).
So I came into the similar situation, and here is what I did in my case.
I just created created a simple method to restore db collection and ran it on rails console.
collections = %w(workers profiles accounts)
path = "path_to_dumped_data"
def restore_collections(collections, path)
collections.each do |collection|
command = "mongorestore --collection #{collection} --db my_database_name #{path}/#{collection}.bson"
system(command)
end
end
restore_collections(collections, path)
Did same way to dump the data before restoring it. :)
Creating backup dump:
pg_dump faves_development -U faves -h localhost -F c > db_backup.dump
db dump will be created in db_backup.dump file
Restore db from dump file:
psql faves_development -h localhost -U faves < db_backup.dump
the backup dump from db_backup.dump will be restores in faves_development database.
Situation might arise when you don’t want password prompt. From example you might want to use crontab for regular file backup and hence want to provide password within the command.
One of many ways to accompolish this is to just create .pgpass file in home directory, and provide password in it. The file should contain information in following format:
hostname:port:database:username:password