Hadoop/ HDFS
Run the following to all nodes in the cluster (NameNode + DataNodes)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# install java allnodes$ sudo apt-get update allnodes$ sudo apt-get install openjdk-7-jdk # install hadoop allnodes$ wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz -P ~/Downloads allnodes$ sudo tar zxvf ~/Downloads/hadoop-* -C /usr/local allnodes$ sudo mv /usr/local/hadoop-* /usr/local/hadoop # create a profile for he environment variables allnodes$ vi ~/.profile export JAVA_HOME=/usr export PATH=$PATH:$JAVA_HOME/bin export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop # load these environment variables by sourcing the profile allnodes$ . ~/.profile # hdfs configuration for all name and data nodes allnodes$ sudo vim $HADOOP_CONF_DIR/hadoop-env.sh export JAVA_HOME=/usr allnodes$ sudo vim $HADOOP_CONF_DIR/core-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://namenode_public_dns:9000</value> </property> </configuration> allnodes$ sudo vim $HADOOP_CONF_DIR/yarn-site.xml <configuration> <! — Site specific YARN configuration properties → ... <property> <name>yarn.resourcemanager.hostname</name> <value>namenode_public_dns</value> </property> </configuration> allnodes$ sudo cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml allnodes$ sudo vim $HADOOP_CONF_DIR/mapred-site.xml <configuration> <property> <name>mapreduce.jobtracker.address</name> <value>namenode_public_dns:54311</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> |
Run the following to NameNode only
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# abstract out the physical ip with name (so you just need to change this file for cluster membership changes) namenode$ sudo vi /etc/hosts 127.0.0.1 localhost namenode_public_dns namenode_hostname datanode1_public_dns datanode1_hostname datanode2_public_dns datanode2_hostname datanode3_public_dns datanode3_hostname # create namenode data dir namenode$ sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode # configure replication factor and location of namenode data namenode$ sudo vi $HADOOP_CONF_DIR/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value> </property> </configuration> # create a master file and specify the NameNode namenode$ sudo vi $HADOOP_CONF_DIR/masters namenode_hostname # list out all slaves and specify the DataNodes namenode$ sudo vi $HADOOP_CONF_DIR/slaves datanode1_hostname datanode2_hostname datanode3_hostname # give ownership to ubuntu namenode$ sudo chown -R ubuntu $HADOOP_HOME |
Run the following to all DataNodes only
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# create datanode data dir datanodes$ sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanode # specify replication factor and location of datanode data dir datanodes$ sudo vi $HADOOP_CONF_DIR/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value> </property> </configuration> # give ownership to ubuntu datanodes$ sudo chown -R ubuntu $HADOOP_HOME |
Start Hadoop Cluster
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# format namenode (all of the data previously on it is lost) namenode$ hdfs namenode -format # start namenode namenode$ $HADOOP_HOME/sbin/start-dfs.sh # establish connection to datanodes (answer yes to all and enter) # check if all Datanodes are online and you should see 3 lives nodes (ie. data nodes) http://namenode_public_dns:50070 # startup yarn namenode$ $HADOOP_HOME/sbin/start-yarn.sh # startup mapreduce JobHistory server namenode$ $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver # use jps to check if all processes on each nodes as below: namenode$ jps 21817 JobHistoryServer 21853 Jps 21376 SecondaryNameNode 21540 ResourceManager 21157 NameNode datanodes$ jps 20936 NodeManager 20792 DataNode 21036 Jps # while running any MapReduce job you can monitor each job http://namenode_public_dns:8088 # see the job history http://namenode_public_dns:19888 |
Work with HDFS
You’re now ready to start working with HDFS by SSH’ing to the NameNode. The most common commands are very similar to normal Linux File System commands, except that they are preceded by hdfs dfs. Below are some common commands and a few examples to get used to HDFS.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# list all files and folder in directory namenode$ hdfs dfs -ls <folder name> # make directory namenode$ hdfs dfs -mkdir <folder name> # copy a file from the local machine (namenode) into HDFS namenode$ hdfs dfs -copyFromLocal <local folder or file name> <hdfs directory> # delete a file on HDFS namenode$ hdfs dfs -rm <file name> # delete a directory on HDFS namenode$ hdfs dfs -rmdir <folder name> # create file with content namenode$ echo "Hello this will be my first distributed and fault-tolerant data set\!" | cat >> my_file.txt |
Connect with us