Faster ETL with Tajo on Big Telco
The challenge was to remove data processing bottlenecks by replacing a legacy MPP DBMS system with new, open-source alternatives, accelerating the business analysis and reporting cycle, and as a consequence improving customer service responsiveness.
The timely analysis of massive data sets is one of the most pressing technical challenges in the telecoms industry today. For SK Telecom, South Korea’s largest wireless communications provider, this is more the case than ever. Facing fierce competition in the one of the most mobile-reliant countries on earth, big data analysis at SKT is directed towards providing quality of service to end users, and to attending to user needs in real-time.
In an industry where data stores expand at rates well above an astonishing 100TB per week, the commercial MPP DBMS that SKT was using for its ETL processing encountered major scalability and cost limits. In response, the company took the bold step of replacing its traditional data integration layer with a combination of Hadoop and Hive.
To a certain extent, Hive-on-Hadoop—working in combination with a standard-issue DW stack—represented a significant improvement for the company. But the explosion of smartphone usage rates quickly began straining even this new, improved combination. The need for near real-time reporting on SKT’s multi-PB data warehouse had brought the company’s IT department to a new fork in the road.
Expansion options were available, but at a price: US$10K per data node, with a total capital burden in the order of US$5M for something not even close to approaching complete, real-time data movement. Such are the data warehousing scalability constraints facing many companies today.
To meet the company’s pressing analytics needs, SKT began the search for alternative ETL processing solutions, trialing Cloudera Impala, Apache Tez and Apache Tajo, among others.
“We searched through much of the available technology in a bid to better analyze our enormous volumes of data, and Tajo was the solution which stood out most,” said Keuntae Park, an IT manager at SKT.
Still in an early phase of development at the time, Tajo performed strongly on realistic massive data workloads due to its high-tech distributed processing engine and advanced query optimization techniques.
According to rigorous internal testing at SKT, Tajo clocked query speeds 3.7 times faster than Hive, in effect reducing workloads by a startling 70%. Not only did Tajo shine in long-running batch processing, but it provided the low-latency and interactive performance needed for ad-hoc queries.
The migration from Hive to Tajo went smoothly due to Tajo's comprehensive SQL-compliance and Hive compatibility. Given Tajo’s support for ANSI SQL and Hive metastore, most of SKT’s pre-existing Hive queries and data objects were recycled as they were, or with only slight modification (such as the rewriting of Hive UDF to standard SQL).
With the move to Apache Tajo, SKT was able to process its massive ETL data workloads much faster than had been the case prior to the migration from their commercial MPP DBMS to Hive, and then to Tajo. The company’s analytics teams were able to start their analyses hours and even days earlier, and the company was able to make decisions and manage changing conditions far more responsively.
For the company bean counters, Tajo’s high throughput and the Tajo-on-Hadoop architecture brought with it enormous cost advantages: Less hardware, less technical HR, and less data processing time all meant savings and freed-up capital that could be allocated elsewhere.
For SKT’s IT department, the project resulted in a huge tick of approval for its new embrace of open-source software solutions. Hive had been a step in the right direction for the company, but Tajo’s query speed on massive data sets was seen as the logical successor to the company’s legacy commercial MPP DBMS. Handling the wilds of enterprise telecommunications with aplomb, Apache Tajo—with its scalability, speed, stability, and ease of integration—represented a big win for Park and his colleagues, and an even bigger win for SKT’s customers.
Fast, Telco-Scale EDW with Tajo-on-Hadoop
The challenge was to replace costly, commercial EDW infrastructure with the massively-scalable, open-source combination of Tajo and Hadoop.
SK Telecom, South Korea's largest mobile carrier, focuses its big data analysis efforts on customer retention and customer experience improvement. In an industry where service quality can win and lose customers in moments, that means identifying and responding to data signals in near real-time.
But achieving real-time analysis in an industry which handles mountainous volumes of data is no easy business task. Not only are SKT’s data requirements growing exponentially, but traditional DW cost structures make it difficult to implement such massive data analysis in a cost-effective manner.
Having already replaced its legacy vendor MPP DBMS with Apache Tajo in order to speed up its ETL process (see this Field Case), SKT faced a new tech dilemma: Should it extend its existing vendor data warehouse, or turn to new, open-source technology with an eye to the future?
To put this business challenge into perspective, recent SKT estimates put its data store growth at well over a phenomenal 250TB+ every single day, with total store holdings reached a mammoth 91PB by the end of September 2014. Not only does this astonishing amount of data have to be extracted, transformed and loaded into data warehouses (again, see this Field Case), it also has to be distilled and reorganized into more manageable low-latency data marts for faster analysis and reporting across its business departments.
The Hadoop story is by now well-documented. With its capacity for storing and processing enormous amounts of data cost-effectively, and with its ability to scale with efficiency, it has rapidly become a trusted, mainstream business technology.
However, storing and processing data are only part of the business IT story; stored data must also be broken down into manageable subsets in order to be queried at everyday enterprise speeds. Even with the rise of Hadoop, this data preparation task has largely remained within the purview of traditional data warehousing technologies.
Fortunately, this problem also happened to catch the eye of the Hadoop crowd and its ecosystem of enthusiastic developers and visionaries. Buoyed by the success of the yellow elephant and its menagerie of spin-offs, it wasn’t long before these Young Turks turned their attention to re-writing the rules of data warehousing.
And thus was born a new generation of data technologies: SQL-on-Hadoop.
Meanwhile, the IT Team at SKT began reimagining its data infrastructure to cope with its burgeoning data holdings. Having already upgraded its ETL processing infrastructure from avendor MPP DBMS to Hive-on-Hadoop, and then to Tajo-on-Hadoop (again, see this Field Case), SKT’s IT Department began running tests on Tajo’s enhanced features such as its columnar storage facility. Tajo was already handling big data analytics with cost-effective speed and sophistication; might it also transform SKT’s Big Data Warehousing performance?
According to Apache Tajo’s architectural design, refined data sets stored in Tajo tables could be directly accessed by standard issue BI and OLAP reporting tools, saving time by removing the need for an additional data mart layer. If this proved effective under SKT’s heavy-duty internal conditions, it would not only boost the company’s data distribution and retrieval speeds, it would dramatically cut the cost burden of vendor software licensing.
With business departments and marketing teams banging down the doors of the IT department demanding ever-faster analytics capabilities, a new DW project was organized with little time to spare.
It didn’t take long for SK Telecom’s new Tajo-on-Hadoop “Big DW” system to start kicking goals for the company. With a storage capacity 8 times greater than its vendor MPP DBMS predecessor, Tajo-on-Hadoop managed to quadruple SKT’s per volume ETL processing speed—at a mere 10-20% of the cost of the MPP DBMS vendor alternative.
Just to reiterate those jaw-dropping gains: By migrating its vendor MPP DBMS to Tajo-on-Hadoop, SK Telecom expanded its data store capacity by 800% and increased its per volume ETL processing speed by 400%—at 10-20% of the cost of the traditional vendor alternative.
This meant that users across SKT’s business departments gained access to “real-time” (2-3 seconds on average) data analysis, not only when running simple, structured queries, but more importantly when executing advanced, complex OLAP queries. Meanwhile, standard business OLAP queries maintained an expected sub-second performance.
Additionally, Tajo empowered SKT business departments with the ability to run direct, analytical-level queries on HQ massive data sets, without the burden of intermediate data marts and cumbersome IT department protocols. With the added convenience of running ad hoc queries through familiar BI/OLAP solutions, Apache Tajo has provided SK Telecom with a glimpse into the real-time future of business information.
E-Commerce Recommendation Platform
In order to provide an industry-leading online shopping and discovery experience, GS Shop re-imagined their platform and re-engineered it with the Hadoop Ecosystem.
GS Shop is a major Asian broadcast and online retailer, providing unique shopping experiences to customers through a range of media, from FTA and cable TV, to the Internet and mobile devices, and traditional shopping catalogs. The company focuses on providing a superior shopping experience by helping customers discover and access products and deals which both delight and surprise.
"With the advent of the internet, consumers now have access to an abundance of information. However, the sheer volume of available information can be overwhelming. Some consumers find it difficult to determine what the best deals are and where to find them. Based on our extensive experience operating diverse online retail platforms, GS Shop helps customers make the smartest shopping decisions by providing first-rate products and honest information."
- GS Shop
One of most effective ways of helping customers in online and mobile environments is through the use of recommendation systems or “reco engines”, based on consensually-shared user behavior data and transaction history. GS Shop had originally deployed a commercial vendor recommendation engine to do the job, but were far from happy with the results: The vendor solution was far from data-driven and heavily dependent on operator heuristics and input, requiring the system logic to be constantly re-modeled and updated in order to maintain high levels of relevancy and therefore customer satisfaction.
Moreover, GS Shop’s legacy recommendation system lacked broader company integration capabilities, with its vital data siloed and lacking service-wide access and delivery.
This lack of integration extended to the company’s search solution. Using still another commercial solution for its product search requirements, GS Shop found it extremely difficult to quickly roll out new service requirements, search features, search quality improvements and user experience enhancements. GS Shop desperately needed a more flexible, customizable and extensible search engine—without the heavy burden of increased licensing costs.
In time, one thing became patently clear to the company: Big data analysis was the key to its future and the foundation of its competitive advantage. Rising to the challenge, GS Shop acted quickly to design and develop a high-spec, fully-integrated, world-class recommendation platform based on the open-source Hadoop ecosystem.
Rather than rely on isolated solutions, the company wanted to control its own service infrastructure, adding new applications on spec, and updating and maintaining existing applications as required without delay. One such application would of course be a new, cutting-edge, data-driven recommendation system.
In order to complete the job, they needed a partner who would understand their vision and realize it through a professional end-to-end design, development, testing, training, rollout and feedback process—backed by world-class support and service agreements. It wasn’t long before they booked an appointment with Hadoop ecosystem experts, Gruter.
The project was divided into 3 phases.
The first phase focused on in-house consulting in order for Gruter to become intimately acquainted with GS Shop’s needs and internal resources. Intensive Hadoop and Hadoop ecosystem training sessions bridged knowledge deficits, established engineering camaraderie, and built a shared and mutual understanding of the mission. A pilot project was implemented in order to test and refine this understanding, and to build the internal resources GS Shop needed to carry out its vision moving forward.
The second phase involved the design and development of a brand new Hadoop-based big data platform from scratch. The Hadoop ecosystem components brought together in order to realize the platform’s key requirements included the following:
The second phase also involved two intensive training sessions on Hadoop operation and management, and the effective use of Gruter’s time-saving Cloumon management solution.
The third phase focused on improving the company’s search system. The prior commercial search engine, was replaced with Gruter's open-source search engine solution, Sprinter. The core components of Sprinter included the following:
In addition to the platform, Gruter also equipped GS Shop engineers with a systematic methodology and process for managing and improving search quality improvement. This involved data modeling, ranking tuning, and the search quality evaluation and feedback cycle. Instilling the GS Shop engineering team with an intimate knowledge of search theory, methodology and procedures enabled the company to implement one of its core strategic goals: The use of search quality as a major operational KPI.
THE BOTTOM LINE
GS Shop’s brand new big data platform has helped it re-write the customer experience rulebook. With full control of its own high-speed, heavy duty, fully-integrated big data platform, the company is able to implement a wide range of strategic business initiatives, from enhanced customer recommendation, to user behavior analysis, and data-driven decision making.
Additionally, GS Shop now has its own highly-trained team which has developed alongside the new platform, having nurtured it from its earliest days, and mastered it since. This has given GS Shop enormous internal facility when it comes to responding to rapidly-changing retail and behavioral trends, building and modifying its own applications in rapid response to the customer insight it gleans from its new, integrated and powerful Hadoop-based big data platform.
Digital Music Marketing Insight Platform
Digital music service provider Melon has transformed itself into a major digital music ecosystem through its Hadoop-based big data platform. Building more than simply a data platform, the innovative company also built its own data-driven operations processes and user experience management system, critical planks of its broader big data platform strategy.
Provided by Loen Entertainment Inc., Melon is the largest digital music service platform in South Korea , delivering over 2.6M songs to more than 24M users who access the company’s services on PC and mobile, and through new smart TV options.
With its extensive experience in the world of online digital music, Melon has long had a comprehensive, forward-thinking business strategy revolving around a high-tech marketing platform which connects the key stakeholders of the digital music world—consumers, fans, artists and record labels—into a single, integrated ecosystem. With massive data holdings accumulated over ten years of successful business operations, the company knew it had the raw materials of something special. The centerpiece of the vision was to be the company’s own big data platform, finely-tuned and specced according to business intelligence derived from the company’s decade-long, intensive engagement with its users.
Melon’s data stockpile included user behavior data, transaction history, click streams and metadata derived from specific services, stored in both files and database formats. However, as with many industries rooted in the rise of broadband and mobile technologies over the past decade, the business challenge for Melon has never been about a lack of data. The challenge has always been about storing, processing, organizing and querying that data, putting it to work on behalf of the company’s expert business knowledge.
Facing growing data processing bottlenecks, and with batch analysis on decreasing fractions of the company’s data holdings taking longer by the day, enormous swathes of the Melon’s rich trove of business information was at risk of being filed with ageing documents in the company’s basement. Drastic action was needed on the IT front for the company to start extracting full value from its data assets.
"Initially, we tried to process our big data with a commercial DW solution, but the up-front licensing costs and maintenance burden was becoming increasingly prohibitive. The limitations of depending on vendor solutions were clear. We concluded the key to our success was building our own platform, and applying our internal knowledge and experience first to configuring big data analysis, and then implementing it in our daily operations. Investing ourselves in a comprehensive two-year technology transformation, Gruter partnered us from the ground up, helping us design, build and implement a big data analysis platform system that would give us full value from our data, and help drive our business into the future"
- Byoung-hwa Yoon, IT Manager, Loen Entertainments
In order to gain full value from the company’s sophisticated data holdings, not only did Melon need a data platform that could handle massive data sets, but it needed to be able to interact with TBs of data, analyze it at speed, and convert that analysis into algorithms and heuristics derived from the company’s business insight.
Starting with a big data platform based on open-source Hadoop ecosystem technologies, including Flume for real-time log collection, Sqoop for data loading to an RDBMS, and HBase for storing analysis results and responding to user behavior in real-time. Hive was initially used for large-scale data analysis, with Apache Tajo now being implemented for superior ETL performance.
With its shiny new analysis platform in hand, Melon was able to launch exciting new services for users, from real-time music charts, fan ratings, personalized content and sophisticated recommendations. Moreover, such was the scope of the platform that the company was able to launch a new business initiative, Melon’s Partner Center, in order to provide data API access and targeted insight to strategic ecosystem players, enabling specialized services within the digital music community to maximize their expertise and grow their business through better-targeted products and channels for communicating with the beating heart of the music industry: Passionate, motivated and energized fans.
"Hadoop enabled us to process the big data we simply couldn’t handle through our legacy commercial DW solution due to technology and cost issues. We believe we are at the beginning of a long and exciting journey. We know we have so much more insight and value to extract from our big data, and so many more exciting applications and service to roll out to our passionate users and partners."
- Kang-seok Kim, IT Director, Loen Entertainments
In the process, Melon developed practical data analysis tool for business users to play with big data, It expedited data-driven process and culture in the company.
Well on its way to achieving its vision of becoming a comprehensive marketing platform which provides data-driven insights to music ecosystem partners, and offers more rewarding and exciting experiences and products for users, Melon is benefiting greatly from its smart big data decisions. Expanding its partner ecosystem at a rate the company had previously only hoped was possible, Melon’s decision to go Hadoop and to go with Gruter is looking more enlightened by the day.
Security log analysis system
The challenge was to build a Big Data system that could handle data from thousands of Independent sources in diverse formats.
Collaborating with our multinational IT client, we put our Big Data platform, Qoobah, to work on the problem.
Designed specifically with tasks like this in mind, Qoobah was able to handle the incoming data with ease, giving the client the powerful search capabilities needed to map and analyze security risk signals in real time.
To complete the package, our management console, Cloumon, gave our client confident control of the system, while our technical experts guided our client’s SysAd team members through the commissioning process.
Now, our customer has a centralized repository from which to run security analyses on their system, reducing security threats and protecting valuable intellectual property and assets from theft and damage.
Our customer plans to extend security monitoring to other systems across their distributed networks, and has recently opened their user log data repository to other departments and third parties who are using it to improve internal system design and service delivery.
Real-time log analysis system
The challenge was to collaborate with the R&D department of a major multinational company to model a leading-edge online behavioral analysis system as part of a PoC (Proof of Concept) project.
Drawing on our years of Big Data architecture experience, we were able to help our client map a design which could not only handle the massive set of data involved, but more importantly do so in real time.
Having conquered the main technical challenges of their project, our client is now putting the final touches of their system together based on the results of our Big Data modeling.
Doing things faster and cheaper using Big Data systems often unlocks new ideas previously constrained by the limits of traditional computing.
Bioinformatics data retrieval system for genome data
The challenge was to enter a hot new tech domain with very demanding computational requirements.
Collaborating with our multinational client, we had to solve a massive storage and retrieval problem: the complete genome of each individual maps to a whopping 300GB. Giving our client multi-resolution access to Petabytes of rapidly-expanding genomic data was going to be a difficult task.
Fortunately, we had already tooled our Qoobah Big Data platform with the features necessary to deal with tasks of this magnitude. By combining Hadoop, ZooKeeper and an in-house indexing and retrieval system, we were able to give our client access to the data they required at speed—on time and within budget.
Our client is now managing genomic data far more efficiently and cost-effectively than would have been the case if they had deployed traditional computing products.
This project illustrates, once again, the scalability which typifies Big Data systems and affords innovative companies to push beyond the limits of current industry standards.
Sub-second retrieval from tens of billions of records
The challenge was to build a new generation database that could manage tens of billions of records at breakneck speed without losing data retrieval quality and system availability.
Collaborating with a large international client, our Big Data engineers combined in-house data storage, indexing and partitioning, along with a custom REST-ful serving module, ZooKeeper and Thrift in order to solve the problem.
Now our client is successfully managing trillions of records at a speed sufficient to satisfy their end-user objectives—objectives which place their product at the forefront of an enormous growth market. In our experience, speed is often the primary edge a product needs to lead a market.