#--------------------------------Version bd-16s -------------------------
#---------------------------------------------------------
		Course: Big Data Analytics
		Instructor: Prof. Dr.Dr. Lars Schmidt-Thieme, Mohsan Jameel
		Information Systems and Machine Learning Lab University of Hildesheim
		contact: mohsan.jameel@ismll.de

These guidelines are for installing hadoop as standalone on your laptop or virtual machine 
Help: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html

#-------------TOC-------------
A. INSTALLATION STEPS FOR HADOOP 
B. EXECUTION STEPS
C. COMPILATION STEPS


#---------------------------------------------------------
		A. INSTALLATION STEP FOR HADOOP 
#---------------------------------------------------------
1) Download Hadoop from the link below
http://ftp.fau.de/apache/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz

2) prerequisite for running Hadoop are
  -> Make sure Java is installed on your machine i.e. java and javac are working on commandline
       - if not installed 
          $sudo apt-get install java
  -> JAVA_HOME variable is set
      - to check if it is set run 
           $echo $JAVA_HOME
      - if nothing printed out than you can set it in .bashrc file by using text editor (path to java installation may differ)
          $gedit ~/.bashrc
           and add.  
 	  export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
	  $source ~/.bashrc
   -> check is ssh is installed and passwordless authentication is done.
	$ssh localhost
        - if it ask for password than you need to follow this http://www.linuxproblem.org/art_9.html
	 OR
  		$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
		$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
		$ chmod 0600 ~/.ssh/authorized_keys

3) extract hadoop-2.7.2.tar.gz in a director probably (/home/<username>/hadoop
    $tar -zxf hadoop-2.7.2.tar.gz
4) change director to hadoop-2.7.2

5) setting hadoop configuration files
   a) add JAVA_HOME in the etc/hadoop/hadoop-env.sh
         $gedit etc/hadoop/hadoop-env.sh
    -add/update>> 
	  # set to the root of your Java installation
  	  export JAVA_HOME=/usr/java/latest
    -check if hadoop is setup correctly
          $ bin/hadoop
   b) add following to etc/hadoop/core-site.xml
      $gedit etc/hadoop/core-site.xml
	<configuration>
    	<property>
	        <name>fs.defaultFS</name>
        	<value>hdfs://localhost:9000</value>
	    </property>
	</configuration>

   c) add following to etc/hadoop/hdfs-site.xml
      $gedit etc/hadoop/hdfs-site.xml
	<configuration>
	    <property>
        	<name>dfs.replication</name>
	        <value>1</value>
	    </property>
	</configuration>


   d) check if daemons are running
	$jps
	output 
		9547 DataNode
		9388 NameNode
		9745 SecondaryNameNode
		16160 Jps

#---------------------------------------------------------
		B. EXECUTION STEP FOR HADOOP 
#---------------------------------------------------------
1)Setting for executing a job
	a) First you need to format the filesystem... (only once is enough)
		$./bin/hadoop namenode -format
	b) NameNode daemon and DataNode daemon
		$./sbin/start-dfs.sh 
           NameNode information available at http://localhost:50070/
	  
	c) Make the HDFS directories required to execute MapReduce jobs:
	
		$./bin/hdfs dfs -mkdir /user
		$./bin/hdfs dfs -mkdir /user/mohsan


2) Putting datafiles to hdfs.
      whenever you want to run a job first you need to put your data on hdfs for that you can do. (for example)	
	$./bin/hdfs dfs -put <yourfile or folder> <path to hdfs>
	example: (put folder /etc/hadoop into input folder of hdfs
		$./bin/hdfs dfs -put etc/hadoop input
		
3) Executing a job
	1) to execute a job you need to provide jar file with classname,input and output folders as well as additional paramter if required by program
		$./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
	2) see the output
		$./bin/hdfs dfs -cat output/*
4) WordCount example (tutorial https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html)
	$./bin/hdfs dfs -mkdir wordcountinput
	$./bin/hdfs dfs -put ../example/Hadoop-WordCount/input/ wordcountinput
	$./bin/hdfs dfs -ls wordcountdata
	$./bin/hadoop jar ../example/Hadoop-WordCount/wordcount.jar WordCount wordcountinput wordcountoutput
	$./bin/hdfs dfs -cat wordcountoutput/*


#---------------------------------------------------------
		C. COMPILATION STEPS
#---------------------------------------------------------
1) Compiling Hadoop code on commandline (please note that version number may differe in the name of jar file which you have to change according to your version number)
	$javac -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-2.7.2.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.7.2.jar:$HADOOP_HOME/share/hadoop/common/lib/commons-cli-1.2.jar <YOURCLASS>.java

2) Convert to jar file (https://docs.oracle.com/javase/tutorial/security/toolsign/step2.html)
	$jar -cvf <YOURPROJECT>.jar -C <all .class files>