Below is the speech script from Jie Jiang, Tencent’s big data platform director and the chief data expert.
Many people knew that Tencent got four wins in Sort Benchmark’s ordering this year. Some asked me how Tencent achieved this success and what technology we used.
Today, I would like to share some of the background stories.
I believe that many people saw the ad we put at the airports in many cities. That ad is picturing a sprinter. Sorting competition is like a 100-meter sprint in Olympics. What I am trying to say is that, we are more like a marathoner. We are running a marathon, which has taken us seven years to achieve something.
Looking back to the performances in various competitions in the past few years, american companies always oligopolized the champions. Finally, BAT, a Chinese internet firm, kept their champions since three years ago. The development we made in Internet technologies is not less than the progress made by the States, which shows that the computational power in mainland is not less than the computational power in the United States.
In the past, the champion team normally used Hadoop and Spark, and Tencent’s big data platform also starts from Hadoop.
It took us a few years to get the champions. We has tried to achieve perfect performance and maximize the capability of our machines.
Firstly, in terms of reducing cost, as long as we maximize the capability of the hardware, we will minimize our cost. We use a machine structured by OpenPower. The scale of our project is only one sixth of the project from last year’s champion team. Based on the prices of hardwares this year, our total TCO cost is way lower than the cost of last year’s champion.
Secondly, we further our optimization in most of algorithm,making each machine’s CPU, memory, network, disk IO to be able to play to the maximum.
Two of the benchmarks in this competition are related to MinuteSort, which means our competition performance basically depends on the amount of data we can order within one minute. In this case, the efficiency of using time becomes very crucial. Therefore, our performance in MinuteSort this year is five times better than our performance last year. The MinuteSort is also one of the benchmarks that we improved the most. This also means that we do have a huge advantage in using time efficiently. Overall, since we maximize the capability of our hardwares, our project outshined others in this competition. The clusters we used for competitions are now used for the Internet in real world, including high performance computing, graph calculation, depth learning and other fields.
Recall back to the past seven years, we started to build this big data platform using Hadoop as the foundation since the January of 2009. During these seven years, we have been through three generations of platform development.
2009 to 2011 was for the first generation of our platform. At that time, our platform only supported batch kinds of calculation scenarios, such as tables. We tried to focus on the scalability of the platform by increasing the scales of our clusters — from less than one hundred machines in 2009 to about 30 thousands this year. The first generation is more about “Scale.”
The second generation is about “Real Time.” From 2012 to 2014, our platform emphasizes online analysis and real-time computing scenarios, such as real-time tables, real-time lookup, real-time monitor and more.
The third generation is from 2015 till now. We pay more attention to use machine learning tools or platforms to fulfill the data mining needs from various services in Tencent. It implies that our focus shifted from data analysis to data mining, which also means “Data Intelligence.”
The first generation has offline computing structure, which is based on Hadoop. We named it TDW, Tencent distributed Data Warehouse.
The update for Hadoop was slow, and its cluster scale is pretty small, with low stability and usability. It cannot really satisfy Tencent’s requirements, so we decided to further develop our project based on Tencent’s business standard for available software products. We focused on developing the scalability of clusters, and solved the problem that Master’s single neck-bottle point cannot expand. We optimized the scheduling strategies to increase the concurrency of jobs and the HA disaster recovery. What’s more important is that we built complement tools and products to decrease the learning curves for using Hadoop. Lexically, our platform is compatible with Oracle’s grammar, which made it easier for Tencent’s product departments to change their own technology requirements. Hadoop has really high capability in handling big data, however, its efficiency is extremely low when it comes to handle small data sets. Hence, we integrated PostgreSQL to improve the analysis capability for small data sets, which integrated Hadoop and PG together.
In this way, our platform started with less than one hundred machines, to a few hundreds, to a few thousands. After a few years, the size of our single clusters reached 4400. In 2014, this number went up to 8800, reaching the top in the technology industry in mainland. Now our platform has a scale of about 30 thousands machines.
TDW has solved the three issues we had in our core business.
First, it allows us to handle billion-size of data sets, which only takes 30 minutes for the calculation to be done. This is T/P level kind of data analysis capability.
Second, its cost is pretty low. We can use a normal PC Server to achieve performance that normally can only be achieved by using larger machines.
Third, it enhances the disaster recovery ability. Normally if machines crash, our business data will be affected. Tables or data lookup services will not be available at that time. However, if TDW’s machines crash, our core business will not be affected at all. The system itself will automatically switch modes and get data backup.
Since TDW provides users with such crucial functionalities, business departments would like to put their data and conduct needed calculations on TDW platform. By the end of 2012, we have moved all our tables from Oracle and MySQL to TDW.
This new platform makes it possible for us to integrate all our products together as a whole.
We used to have separate databases for each of our products; these databases were not that connected at all. Now, we have developed a system that has billions of users and stores corresponding user data.
Our user data used to be very basic — mainly about users’ fundamental features, such as age, gender, region and more. It used to took more than one week to build a new user object, and we only update our user data monthly. Nonetheless, with TDW, we can update the system every day.
This user data object has been used in our products that are related to recommendation systems.
Recommendation system was still a pioneering application six years ago but it has become a well-known feature nowadays. TDW offers fast supportive functionalities. Used MapReduce as the programming standard method and TDW as the main main platform, we can focus on how to achieve considerable performances of various recommendation algorithms. For instance, we can work on common algorithms like CF, MF, and LR, and other hash clustering algorithms. The recommendation systems we have are more about providing users with real-time lookup services.
The first generation solves the scalability issue but we still had problem with the performance speed – since data is offline, then the task calculation is also office. There is a real time difference. Because of this issue, we moved on to develop the second generation.
The second generation not only used the first generation platform as its foundation, but also integrated with Spark, which is the second generation of Hadoop, and Storm, the stream computing framework. This new platform decreases our computation speed from hours to minutes, or even seconds.
In terms of data collection, we built TDBank, which used millisecond-level, real-time data collection to replace the original approach – using files to record and update data. In this new data collection platform, we received more than 650 billions messages within our research middleware, which is about one of the middlewares that have the highest message load in the world. At the same time, this middleware is highly reliable, and guarantees that no messages will be lost, just like financial and accounting institutes.
In terms of resource distribution, we developed Gaia distribution platform based on Yarn. Yarn only supports the distributions in CPU and internal storage. In addition to CPU and internal storage, our Gaia also supports dimensions that based on Internet and magnetic disk IOs. Yarn only supports offline calculation, while we make Gaia supports online and offline calculation. Furthermore, we support Docker. And now, our platform has more than 150 millions of containers every day.
Going back to the recommendation system example we have before, when we developed the second generation platform, we also encountered two problems: one is that as the amount of users and visits increases, the amount of data produced will be increased to the point that it is impossible for the system to calculate them in limited time. The second problem is that users’ behavior patterns change very fast, which requires us to update user data with extra features more frequently. Since the system collects data in a real-time manner, it is possible for us to have real-time calculation for user data, which forms a stream computing data-stream. Distributed stream computing can update data calculations for various features in real time. This further develops the real-time training data for our recommendation systems. In this way, we transformed our offline recommendation system to an online real-time recommendation system. In ad recommendation application, we observed that the faster we updated the real-time recommendations, our recommendations would receive larger amount of clicks.
The second generation platform fulfilled most of needs from our core business due to its real-time manner and scalability. However, as the amount of data we have grown larger, our system started to reach its bottleneck stage.
When we trained our data on Spark and tried to update data, we often encountered a network issue — since there was only one single spot for updating data, the entire framework could not support high-dimensional data training. In our applications, our platform can still function properly with data in 10 million-level dimension. Once our data grows up to 100 millions, the system can barely function, or cannot run at all.
Thus, we wanted to built a system that supports huge amount of data clusters and data training in billion-level dimension. Moreover, this system needed to satisfy the industrial requirements in our current Internet applications. It should be able to fulfill developers’ needs of handling big data and big model. Also, we expect it to handle data and model concurrency respectively.
There are two ways to solve this problem: the first approach is to use the second generation platform as the foundation, evolve and develop the new product that can solve large-scale parameter exchanges; the second method is to create a brand new high-functional computing framework.
We also took a look at some products that were popular in the industry at the time: GraphLab (for graphical modeling, fault tolerate), Google’s Disbelief (was not open sourced at the time), and Petuum invented by Eric Xing from CMU (was very popular but it was more like a laboratory application with low usability and stability).
At the end, we decided to have our research project. The first generations are based on open source resource, while the third generation is from our own research lab. We had tried to develop our research project when we developed the second generation platform. For instance, our message-oriented middleware with high usability and reliability was developed by ourselves. The experience of developing and testing all of these projects at Tencent had provided us with strong confidence in terms of having our research project.
Therefore, we chose to research and develop our own computing framework for the third generation. The third generation platform is a computing platform with high functionality named Angel.
This high-functionality computing framework is also how we approach machine learning and deep learning techniques.
Different from the previous computing frameworks, Angel supports algorithm training with data in billions dimension. In order to solve issues caused by large models, this new platform not only support data concurrency, but also model concurrency. At the same time, this third generation also supports GPU deep learning, and non-structural data like text, audios, and images.
Angel is a framework based on parameter servers. It runs on our Gaia platform. It also supports three kinds of computing modes: BSP, SSP, and ASP. We use Repeating Crossbow which is developed by Dr. Yang’s team from Hong Kong University of Science and Technology, to schedule our networks － Parameter Servers prefers to serve worker that has lower service; this strategy can drastically decrease the waiting time as the model becomes relatively large; in this way, the overall task performance time will decrease by 5~15%.
Compared Angel with other platforms like Petuum and Spark, Angel has better functionalities than other platforms under the same data dimensions. Below is the figure that compares the performance of Angel and Petuum while we run SGD algorithms on both platforms respectively using data from Netflix.
Furthermore, Angel is a great fit for huge-scale data training. So far, we have used Angel for many internal business services at Tencent. Here are two examples. Before, when we used Hadoop and Spark to build user-data framework, it took one or more days to run the model for one thousand topics. Now there are more than 20 billions of files on Angel, millions of words, 300 billions of tokens. It only takes Angel one hour to finish the computation. What we used to run on Spark can be ran on Angel in much faster speed. What Spark cannot compute can be computed by Angel.
Another example is about estimating the amount of clicks for videos. Under the same data dimension, Angel’s capability is 44 times stronger than Spark’s capability. After our dimension increases from 10 millions to 100 millions, the training time shrank from days to half an hour, with optimized precision.
Angel is more than just a platform that does concurrency computing. It is also a ecosystem. Used Angel as a center, we create a small ecosystem that supports MLLib based on Spark, which supports data training in more than 100 million dimension. Additionally, this system supports more complex graphical calculating models, and deep learning frameworks — for instance, Caffe, TensorFlow, Torch and more — allowing these frameworks being used by multiple machines with multiple cores.
At the end of this talk, I just want to summarize the three stages Tencent’s big data platform had evolved from: we started from offline computing, went through real-time computing stage, and then entered machine learning time. We followed open source, developed and finally reached our own research. Our development changes in terms of scale, real-time functionality, and intellectualization.
We also would like to share a news with everybody here. It is that, our big data platform will be fully open source.
We are planning to open source Angel and its complement systems in the first half of 2017. Our platform starts from open source, and benefits from open source during its development process. Hence, we would like to embrace the open source community by giving back. We have always tried to open source what we had: We open sourced the core of the first generation big data platform named TDW-Hive in 2014; we also shared plenty of core source codes with multiple community projects, and helped educate quite a few committer. In the future, we would like to contribute even more to this community.
Original Article from Synced China http://www.jiqizhixin.com/article/2016|Author: Pan wu, Joyce Zhou | Localized by Synced Global Team: Jiaxin Su, Rita Chen