Finally Spark 2 runs on CDH 5.15! :)



These are the things that I needed to setup or troubleshoot before it got running:

Upgrade Cloudera Manager 

to the latest version by following the instructions on this website
https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cm_ag_upgrading_cm.html

Note: to get Cloudera Manager Express to run on VM with less memory, use --force
sudo /home/cloudera/cloudera-manager --pause --express --force

The important steps would be as follows:-
1) Download cloudera-manager.repo with wget https://archive.cloudera.com/cm5/redhat/6/x86_64/cm/cloudera-manager.repo
2)Put cloudera-manager.repo inside /etc/yum.repos.d/
3) Run these commands
sudo yum clean all
sudo yum upgrade cloudera-manager-server cloudera-manager-daemons cloudera-manager-agent

Upgrade JDK 1.7 to 1.8

Download JDK 1.8 or latest version of JDK and unzip to chosen folder.
I put mine in /usr/java/jdk1.8
Set the JAVA_HOME path in .bashrc
Set the JAVA_HOME in Cloudera Managers->Hosts->Configuration->Java Home Directory

Upgrade CDH to latest version

Parcels->Configuration->Remote Parcel Repository URLs
https://archive.cloudera.com/cdh5/parcels/latest/

Download Spark 2 Parcel

Parcels->Configuration->Remote Parcel Repository URLs
https://archive.cloudera.com/spark2/parcels/latest/

Add the Spark2 CSD

Download http://archive.cloudera.com/spark2/csd/SPARK2_ON_YARN-2.1.0.cloudera2.jar
Put it in /opt/cloudera/csd

Distribute and activate the CDH and Spark 2

Parcels
Then click Distribute
Then click Activate

Adjusted Java heap size (Not sure if this was really necessary)

sudo nano /etc/default/cloudera-scm-server
Change CMF_JAVA_OPTS and  set heap size -Xmx parameter to 4 gb instead of default 2gb and maximum
export CMF_JAVA_OPTS="-Xmx4G -XX:MaxPermSize=512m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp"

Create directory for spark and give it write permission

sudo -u hdfs dfs -chmod 777 /user/spark

sudo -u hdfs hadoop fs -chmod 777 /user/spark
sudo -u hdfs hadoop fs -mkdir /user/spark/spark2ApplicationHistory
sudo -u hdfs hadoop fs -chmod 777 /user/spark/spark2ApplicationHistory

sudo -u spark hadoop fs -chmod 777 /user/spark/applicationHistory
sudo -u spark hadoop fs -chmod 777 /user/spark/spark2ApplicationHistory

https://community.cloudera.com/t5/Hadoop-101-Training-Quickstart/CDH-5-5-VirtualBox-unable-to-connect-to-Spark-Master-Worker/td-p/34491

Run pyspark2

pyspark2

textFile = spark.read.text("/loudacre/salesStaff.csv")
textFile.count()



Comments

Popular posts from this blog

How to create an organizational chart in your webpage using Google Organization Chart Tools