Note. Bootstrap action script for EMR 4.x was added. Check out the differences introduced in 4.x with release of EMR 4.0 at Jul 2015.
Apache Tajo™, or simply “Tajo”, is an open-source relational and distributed big data warehouse (“Big DW”) system which runs on Apache Hadoop and other stores. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL on large-data sets stored on HDFS (Hadoop Distributed File System), Amazon S3 and other data sources.
Within the Amazon Web Service(AWS) cloud environment, Tajo runs perfectly on either Elastic MapReduce(EMR) with Hadoop or Elastic Compute Cloud(EC2) without Hadoop.
This post explains how to setup a Tajo cluster on AWS using the EMR bootstrap actions. Source codes, technical details and latest updates can be found at this github page.
Tajo bootstrap action scripts
Bootstrap actions are scripts that run on the cluster nodes when Amazon EMR launches the cluster. You can find pre-configured Tajo bootstrap action script at:
-
s3://tajo-emr/emr-3.x/install-tajo.sh (for EMR 3.x)
-
s3://tajo-emr/emr-4.x/install-tajo.py (for EMR 4.x)
This script has various optional arguments for custom settings. See more details and sample commands in the github page.
Quick start: Launching a Tajo Cluster with default configuration
Let’s begin with a simple command and go into further details. The command below launches your Tajo cluster with default configuration. (To run this command on your desktop, you need to install and configure AWS Command Line Interface tools on your system.)
* EMR 3.x
$ aws emr create-cluster \ --name="CLUSTER_NAME" \ --ami-version=3.8.0 \ --no-auto-terminate \ --use-default-roles \ --ec2-attributes KeyName=KEY_PAIR_NAME \ --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=c3.xlarge \ --bootstrap-action Name="Install tajo",Path=s3://tajo-emr/emr-3.x/install-tajo.sh,Args=["-t","https://dist.apache.org/repos/dist/release/tajo/tajo-0.11.1/tajo-0.11.1.tar.gz","-c","s3://tajo-emr/template/tajo-0.11.x/c3.4xlarge/conf"]
* EMR 4.x
$ aws emr create-cluster \ --name="CLUSTER_NAME" \ --release-label=emr-4.1.0 \ --no-auto-terminate \ --use-default-roles \ --ec2-attributes KeyName=KEY_PAIR_NAME \ --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=c3.xlarge \ --bootstrap-action Name="Install tajo",Path=s3://tajo-emr/emr-4.x/install-tajo.py,Args=["-t","https://dist.apache.org/repos/dist/release/tajo/tajo-0.11.1/tajo-0.11.1.tar.gz","-c","s3://tajo-emr/template/tajo-0.11.x/c3.4xlarge/conf"]
* Note for Seoul and Frankfurt region
Seoul(ap-northeast-2) and Frankfurt(eu-central-1) region support S3 signature version 4 only. If you launch your instances in those regions, be sure to replace S3 paths in the commands to the same region as your instances:
- (in Seoul region) Path=s3://tajo-emr-seoul/… and “-c”,”s3://tajo-emr-seoul/template/…”
- (in Frankfurt region) Path=s3://tajo-emr-frankfurt/… and “-c”,”s3://tajo-emr-frankfurt/template/…”
It will launch Tajo cluster in several minutes. You can check the status of your cluster in EMR console. When the status of the cluster is WAITING, it indicates your cluster is up and ready.
Deepdive: Advanced cluster configuration
With the default configuration above, Tajo cluster stores Tajo table data in EMR HDFS. And it saves Tajo meta data in Derby database in the master node. It means your data and table schema are no longer accessible when the Tajo cluster instances are terminated. While it is enough for test purpose, it is recommended to save your Tajo table data and meta data in permanent storage, such as S3 and RDS respectively, for uninterrupted analysis work.
Now let’s configure Tajo cluster for realistic use cases:
- Store Tajo table data in your S3 bucket
- Use RDS for Tajo meta store
- Use custom configuration parameters optimized for your instance type. (eg. set heap size and concurrency setting based on the system resources.)
Use S3 bucket for Tajo root directory
Tajo root directory contains configuration files, a warehouse directory, and a temporary directory for Tajo tasks. If you don’t have a S3 bucket for Tajo yet, go to AWS console > S3 and create one. The bucket name should be unique.
To use your S3 bucket for Tajo, set tajo.rootdir property to S3 path in tajo-site.xml.
<property><name>tajo.rootdir</name><value>s3:///mybucket/tajo</value></property>
Or you can do it by adding “-s” paramter bootstrap action script, eg.
- Args=["-s", "tajo.rootdir=s3://mybucket/tajo"]
Prepare RDS for Tajo meta store
Tajo saves its meta data, such as table schema, to the Tajo meta store. To use RDS as Tajo meta store,
- Go to AWS console > RDS.
- Launch a DB instance.
- In “Security Groups” setting, allow access to your DB instance from Tajo master.
- Locate “mysql-connector.jar” in your S3 bucket and specify the path it in bootstrap actions, eg.
- Args=["-l", "s3://MY_TAJO_BUCKET/lib"]
- Create your custom catalog-site.xml with DB connection information, as described in Catalog configuration documentation. Locate the catalog-site.xml file in your S3 bucket and specify the path in bootstrap actions. eg.
- Args=["-c","s3://MY_TAJO_BUCKET/tajo/conf"]
Use custom config files
You may want to launch your Tajo cluster with custom config files, such as tajo-site.xml and catalog-site.xml above. To use your custom config files, locate them in an S3 path and specify the directory path in bootstrap actions. eg.
- Args=["-c","s3://mybucket/tajo/conf"]
In many cases, you cat get performance gain by tunning parameters, such as heap size and concurrency settings that are optimized for your instance type. There are some Tajo configuration files optimized for each EMR instance type. You can simply use these default config files or make your own config files by modifying them.
-
s3://tajo-emr/template/tajo-x.x.x/
(eg. s3://tajo-emr/template/tajo-0.11.x/c3.2xlarge/)
For more information on Tajo advanced configuration, refer to Tajo configuration guide. Also you can find some configuration templates pre-tuned for the instance types here.
Example command with S3, RDS and custom config directory
* EMR 3.x
aws emr create-cluster \ --name "CLUSTER_NAME" \ --ami-version 3.8.0 \ --no-auto-terminate \ --use-default-roles \ --ec2-attributes KeyName=KEY_PAIR_NAME \ --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=c3.2xlarge \ --bootstrap-action \ Path=s3://tajo-emr/emr-3.x/install-tajo.sh,\ Args=["-t","https://dist.apache.org/repos/dist/release/tajo/tajo-0.11.1/tajo-0.11.1.tar.gz",\ "--conf","s3://tajo-emr/template/tajo-0.11.x/c3.xlarge/conf",\ "--lib","http://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.28/mysql-connector-java-5.1.28.jar",\ "--site","tajo.rootdir=s3://MY_TAJO_BUCKET/tajo \ tajo.catalog.store.class=org.apache.tajo.catalog.store.MySQLStore \ tajo.catalog.jdbc.connection.id=tajo \ tajo.catalog.jdbc.connection.password=YOUR_RDS_PASSWORD \ tajo.catalog.jdbc.uri=jdbc:mysql://gaia-tajo-catalog.c2p4kmiqt5tf.us-east-1.rds.amazonaws.com:3306/tajo?createDatabaseIfNotExist=true" ]
* EMR 4.x
aws emr create-cluster \ --name "CLUSTER_NAME" \ --release-label=emr-4.1.0 \ --no-auto-terminate \ --use-default-roles \ --ec2-attributes KeyName=KEY_PAIR_NAME \ --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=c3.2xlarge \ --bootstrap-action \ Path=s3://tajo-emr/emr-4.x/install-tajo.py,\ Args=["-t","https://dist.apache.org/repos/dist/release/tajo/tajo-0.11.1/tajo-0.11.1.tar.gz",\ "--conf","s3://tajo-emr/template/tajo-0.11.0/c3.xlarge/conf",\ "--lib","http://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.28/mysql-connector-java-5.1.28.jar",\ "--site","tajo.rootdir=s3://MY_TAJO_BUCKET/tajo \ tajo.catalog.store.class=org.apache.tajo.catalog.store.MySQLStore \ tajo.catalog.jdbc.connection.id=tajo \ tajo.catalog.jdbc.connection.password=YOUR_RDS_PASSWORD \ tajo.catalog.jdbc.uri=jdbc:mysql://gaia-tajo-catalog.c2p4kmiqt5tf.us-east-1.rds.amazonaws.com:3306/tajo?createDatabaseIfNotExist=true" ]
Launch Tajo cluster in EMR console
Alternatively, you can launch your Tajo cluster in EMR console web interface.
- Go to EMR console and click Create Cluster.
- (Updated) Click “Go to advanced options” by “Quick cluster configuration”
- In Cluster Configuration section, enter Cluster Name.
- In Software Configuration section, choose Hadoop distribution as “Amazon” and AMI version as 3.5.0 or higher. In Applications to be installed table, delete unnecessary applications.
- In Hardware Configuration section, choose instance type and number of instance for Master and Core nodes.
- For Master node where Tajo master is installed, m3.large or higher spec instance is recommended in general.
- For Core nodes where Tajo worker is installed, c3.xlarge or higher spec is recommended in general.
- In Core instance count, set number of worker nodes in your Tajo cluster.
- Set “0″ in Task instance count. Tajo does not use Task instance.
- In Security and Access section, choose your EC2 key pair for SSH connection.
- In Bootstrap Actions section, choose Coustom action and then click Configure and add button.
- In Add Bootstrap Action popup window,
- In Name, enter bootstrap action name, for example, “Tajo EMR”
- In S3 location, enter the S3 path of bootstrap action script, eg. “s3://tajo-emr/emr-3.x/install-tajo.sh”
- In Optional arguments, enter arguments string for bootstrap action, eg.
- -c s3://tajo-emr/template/tajo-0.11.0/c3.xlarge/conf” -l http://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.28/mysql-connector-java-5.1.28.jar
- Click Create cluster button and it will start to launch EMR cluster. Check the status of your Tajo cluster in AWS console > EMR.
Now you have your own Tajo cluster on AWS! The next stage is where the fun starts: Running big data analysis with Tajo.
Play with Tajo on AWS
You can run queries in the Tajo command line shell (TSQL) on your Tajo master node.
$ ssh -i keyname.pem hadoop@your-master-node-ip $ /home/hadoop/tajo/bin/tsql Try \? for help. default>
Tajo can access data stored on S3 as an external table, as the following example:
CREATE EXTERNAL TABLE wikistats ( language TEXT, page_title TEXT, hits BIGINT, retrived_size BIGINT ) USING TEXT WITH ('text.delimiter'=' ') LOCATION 's3a://support.elasticmapreduce/training/datasets/wikistats/'; SELECT language, avg(retrived_size) as avg_size FROM wikistats GROUP BY language ORDER BY avg_size DESC LIMIT 20;
For further information on Tajo’s SQL language features, see the Apache Tajo documentation.
Monitor Tajo cluster status with Tajo web admin console
You can then monitor Tajo’s cluster status, jobs and resource usage in the admin panel located at http://your_master_node_ip:26080/ .
To access this pannel, you need to allow access to 26080 port in your security group setting. Go to AWS console > EC2 > Security Groups, and choose your security group for Tajo master (eg. ElasticMapReduce-master) and then add the port in Inbound tab.
Clean up
When you are done running queries using Tajo, you should shut down your cluster to avoid incurring further charges.
-
Disconnect from the master node by terminating your SSH session.
-
From the command line on your local machine, run the following command to terminate your Amazon EMR cluster. Replace “j-762J99T8QGKAC” with the identifier of your cluster.
aws emr terminate-clusters --cluster-ids "j-T9RFQLKZX9CK"
-
Delete any log files stored in your S3 bucket for Tajo. For more information, see http://docs.aws.amazon.com/AmazonS3/latest/UG/DeletinganObject.html
Conclusion
You can setup a Tajo cluster on EMR using bootstrap actions in minutes. With direct access to S3 data and the performance advantage, Tajo can be a powerful solution for your big data analysis on the cloud.