Google failed me....When I was thinking about how to start off this blog post I ran the gamut of jokes, quips and amusing anecdotes but in the end I felt it fitting to start with my disbelief. In my years of IT Google has always had all the answers...Mind you not always EXACTLY how I've wanted them, aka: spending an afternoon modifying code, using some random error that was kind of like mine to fix an issue or getting the answer "you can't do that you shouldn't have taken on this project/problem in the first place". But this time nothing. There was plenty of material about JDBC drivers or Hive with ODBC drivers outside of Amazon Web Services Elastic Map Reduce but nothing that worked for my instance and certainly nothing that was going to help me solve my problem. So I will add to the almighty Google and help it fill the void that I tragically found.
First, and foremost let me explain my situation a little to give you a general overview of what we are doing here. I have an Amazon Web Services (AWS) account and am using it to spin up Elastic Map Reduce (EMR) instances. EMR is (according to Amazon) :
Amazon EMR is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).Basically, what all this means is EMR is cloud Hadoop. If that doesn't mean anything to you then you probably aren't really in need of this post to begin with. My issue came when I was trying to use Microsoft Excel to attach to Hive (A component of the HADOOP installation) and view the tables, columns, etc that were being spit out to me by my HADOOP processes. Mind you this is a over simplification for what is going on, but I'd rather get to the good stuff instead of spending all day explaining what very little I know about HADOOP and its processes. So without further ado here is the process I used to get my ODBC driver set up with HIVE on my AWS EMR instance. (That was a mouthful -ed.)
First we will need to download the ODBC driver that is available through AWS. This can be found at : http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-bi-tools.html . Download the ODBC driver that is necessary for your environment. In my case http://amazon-odbc-jdbc-drivers.s3.amazonaws.com/public/HiveODBC.zip.
Once this has been downloaded we will install it on our machine. This is a pretty straight forward process. Make sure we choose the correct version of the product. Windows is my OS, 64bit is what my machine is running at home. Next, Accept EULA, Next, Installer Location, Next, Install, Finish
After the installation we will need to launch an EMR cluster, if one is already running don't worry about this step. NOTE: many times ODBC connections to HIVE will call for a Thrift server. By default if you have HIVE installed on your EMR cluster Thrift will be installed as well and this will be moot.
Once we have made sure our new cluster is provisioned and running we will need to connect to it. We will start with the normal steps that we find on the AWS site for establishing SSH connections.:
If using Putty, after selecting the session option on the left, add the Hostname (hadoop@ec2-###-###-###-###.us-west-2.compute.amazonaws.com, change the hashes with the servers public IP) and the Port (22)
We will then select the plus next to SSH and expand it. From here we will go to Auth. Under "Private key file for authentication" put the key that is associated with the cluster we have launched and added to our "hostname" in the above step.
Next, click on the "Tunnels" on the left hand side. this is where things will deviate from the original instructions as outlined by Amazon. We will still add our 8157, by default, tunnel for the connection to the web tools for our EMR instance but we will add on additional tunnel. This tunnel will be for the connection we will be making with our ODBC driver to HIVE on the machine. By default this tunnel needs to sit at port 10000. This is determined by the installation version of HIVE on the EMR instance. For our purposes, we will select Local and Auto which can be found BELOW "Destination". In the "Destination" section we will add the hostname of our cluster sans the hadoop@ that proceeds it. We will then at the end of our hostname add :10000 with 10000 being the tunnel we are establishing. (ec2-###-###-###-###.us-west-2.compute.amazonaws.com: 10000). In source port we will also add 10000. At the end of the day, prior to clicking "Add", It should all look something like this.
Once this is to your liking click "Add" and it will be displayed in the box above "add new forwarded port:". For the sake of speeding these instructions up we will also add our 8157 port with Dynamic and Auto selected, as per Amazons original guidelines.
Once these steps have been completed select Open. to pull up your HADOOP SSH instance.
We will now want to verify that we can reach the port that we just opened up through SSH to the server. The easiest way to do this is with a simple telnet command. So open up command prompt, or powershell if you're into that kind of thing, and type in (telnet ec2-###-###-###-###.us-west-2.compute.amazonaws.com 1000). If all goes well you should be presented with an empty prompt box...otherwise you will get an error and will need to check security settings on both your local computers side and the AWS VPC side.
Without keen insight into any one given persons environment getting this telnet session to establish will be different from environment to environment. The best advice I can give you is to make sure your ports are allowed through Windows Firewall, AntiVirus, AWS VPC Security settings (Inbound and Outbound) and then lastly with your companies physical firewall settings.
Now that the basics are taken care of we can configure the ODBC driver. First, find your ODBC Data Sources (32-bit or 64-bit) program under administrative tools and open it. I will be using the 64 bit version of the tool since I have 64 bit Microsoft Excel that I will be accessing the ODBC connection with.
In the window presented we will select the "Amazon Hive ODBC Driver" option and click "finish".
We will now start adding the information we know into the "Amazon Hive ODBC Driver DSN Setup" Window. Our "Data Source Name" can be whatever we want. The "description" is also trivial. For the "host" we will once again add our Hostname (ec2-###-###-###-###.us-west-2.compute.amazonaws.com) and we will change the port to the port that we added for HIVE, in our case 10000. For the database we will need to put in whatever database we have created or used, I'm just using default, and for "Hive Server Type" we will put in "Hive Server 1". NOTE: If this does not work try using "Hive Server 2". I have had problems where "Hive Server 1" does not allow me to put in my User Name and Password.
Moving onto Authentication our Mechanism for Authentication will be username and password. By default this will be set to "user name" : emr "password" : emr.
Once these steps are completed click Test.
If everything is happy we should return a "TESTS COMPLETED SUCCESSFULLY!" message. Select OK to close the test window and OK again to accept and save the changes to our DSN setup.
With all these steps completed its a simple Datasource addition into Excel before you are enjoying your HIVE data in your Microsoft Excel spreadsheet. You're welcome Google.
Thanks, this was really helpful!
ReplyDelete