Getting started with Tajo on your desktop

Apache Tajo™ is a next-generation big DW system designed for large-scale distributed data processing. Able to run on hundreds of nodes, it can crunch massive scale workloads. But did you know it can also scale right down to run on a Macbook?

With Gruter’s new Tajo Desktop Package, you can install Tajo on Mac and Linux, and immediately set to work on data analysis. Not only is Tajo Desktop Package a handy tool, it’s a quick and easy way to explore the functionality and power of Apache Tajo.

Spreadsheets such as Excel are great work tools, but they have difficulty dealing with millions of records, while DBMS engines, such as MySQL or Microsoft Access, depend on additional ETL jobs to load data, and even then still struggle under large data sets. In contrast, Tajo Desktop Package enables you to carry out interative data analysis and long-running batch queries on large data sets, right on your laptop.

What is more, with Tajo Desktop Package, you can run SQL queries directly against big raw data files, free of the burden of additional ETL jobs.

Download & Install

1. Download the Tajo Desktop Package (for Mac and Linux)

Download the package and extract the files.

$ tar xvfz tajo-0.9.x-pc-x.x.tar.gz
$ cd tajo-0.9.x
2. Configure Tajo
Set JAVA_HOME, Tajo directories and heap memory size.
$ bin/configure.sh
3. Initiate Tajo
Initiate the Tajo master and worker(s).
$ bin/startup.sh
4. Load the sample data set (optional)
The script below will generate a sample database with 8 tables based on the TPC-H test data set. Make sure Tajo has been properly initiated before running this command.
$ bin/make-test.sh
5. Run the Tajo command-line shell (TSQL)
$ bin/tsql
Try \? for help.
default>

That’s it! Now you’re ready to use Tajo!

Play with sample data set

Begin with the sample TPC-H data set included in the package.

default> \c tpc_h10m
You are now connected to database "tpc_h10m" as user "username".

tpc_h10m> \d
customer
lineitem
nation
orders
part
partsupp
region
supplier

As an example, calculate the total order amount by country by joining 3 tables:

tpc_h10m> SELECT n.n_name as nation, sum(o.o_totalprice) as order_amount 
FROM customer c, nation n, orders o 
WHERE c.c_nationkey = n.n_nationkey and o.o_custkey = c.c_custkey 
GROUP BY c.c_nationkey, n.n_name 
ORDER BY n.n_name;

nation,  order_amount
-------------------------------
ALGERIA,  827414.8900000001
ARGENTINA,  1064770.36
BRAZIL,  870015.82
CANADA,  814680.27
...

Analyze your own data

You can create external tables with your own data files and run SQL queries on them directly.

1. Select your data source

(If you don’t have your own local data set, you may wish to download a public data set from a reputable online sources, such as Public Data Sets on AWS or Google public data)

To keep things simple, let’s prepare the orders data included in the TPC-H data set.

$ cp  your_tajo_dir/data/tpc-h10m/orders.txt  /tmp/

You will note the first line of “orders.txt” looks as follows:

1|36901|O|173665.47|1996-01-02|5-LOW|Clerk#000000951|0|nstructions sleep furiously among |
2. Create your database
default> CREATE DATABASE IF NOT EXISTS mydb;
OK
default> \c mydb
You are now connected to database "mydb" as user "username".
3. Create external table with your data
mydb>  CREATE EXTERNAL TABLE IF NOT EXISTS mydb.orders 
(O_ORDERKEY bigint, O_CUSTKEY bigint, O_ORDERSTATUS text, 
O_TOTALPRICE double, O_ORDERDATE text, O_ORDERPRIORITY text, 
O_CLERK text, O_SHIPPRIORITY int, O_COMMENT text) 
USING CSV with ('csvfile.delimiter'='|') 
LOCATION 'file:///tmp/orders.txt';

mydb>  SELECT o_orderkey, o_custkey, o_orderdate FROM orders;

o_orderkey,  o_custkey,  o_orderdate
-------------------------------
1,  36901,  1996-01-02
2,  78002,  1996-12-01
3,  123314,  1993-10-14
...

For more information on Tajo SQL syntax, refer to the Tajo SQL langauge reference.

Monitor Tajo in the Tajo admin panel

You can then monitor Tajo’s cluster status, jobs and resource usage in the admin panel located at http://localhost:26080/ .

tajo_on_desktop_01

 

Conclusion

Running advanced data analysis on the Tajo Desktop Package is as easy as that!
By initiating a single-node Tajo cluster on your desktop, you can run queries on local data files without the need to convert data or run ETL jobs. Importantly, Apache Tajo is highly scalable: As your data grows, Tajo is able to scale out to a fully distributed environment without a change of architecture.