Traffic data analysis provides insights on content interaction that help publishers better understand their readers and is a valuable tool for anyone writing and publishing articles on the Internet. Traffic data analysis is however usually article-based, and as such can only reveal relative article popularity — it does not provide information on which specific part(s) of an article contributed to the traffic.
SyncedLeg is a tool designed to help with that by mining influential keywords from the corpus with traffic data. A team of Synced interns developed the tool over an internal two-day Hackathon, naming it after their team “机器之腿” (“Machine’s leg” in Chinese).
With SyncedLeg, users can:
- Identify hot words/topics they may want to include more often based on their occurrence in high traffic articles.
- Identify words/topics they may want to avoid based on low traffic articles.
- Track topic trends based on keywords in articles published during different time periods (monthly, quarterly, yearly, etc.).
- Discover different writing styles from different media/authors based on the extracted keywords.
- And much more!
As input, we used articles published on the Synced WeChat subscription account – “机器之心” (WeChat subscription accounts are among the largest content distribution platforms in China, reaching millions of mobile users). It is also possible to use articles from other sources that contain similar features to perform the analysis. (Users may want to try other tools for pre-processing and word extraction to improve results with articles in languages other than Simplified Chinese.)
Articles from WeChat subscription accounts usually contain the following fields, which users can choose to include or ignore for analysis in SyncedLeg:
- title (Title of the article)：Adopted. One of the primary sources for the analysis.
- link (URL to the article)：Not adopted.
- publishAt (publishing time)：Adopted and customizable. Users can select articles from a particular time period for analysis. (Performing the analysis on articles from all times may not be as useful as specifying quarterly or seasonal articles. For example, “AlphaGo” was one of the hottest words of 2016 but its popularity has since decreased.)
- readNum (number of reads)：Adopted and customizable. Used as one of the metrics to measure the popularity of an article. Users can decide whether or not to use it for analysis.
- likeNum (number of likes)：Adopted and customizable. Used as one of the metrics to measure the popularity of an article. Users can decide whether or not to use it for analysis.
- msgldx (location of the article when published)：Adopted and customizable. In a WeChat subscription account, articles pushed to subscribers are organized in a message block in which location 1 takes the most space and is more likely to draw clicks. This parameter enables the selection of articles in the same location for analysis to eliminate the effects caused by different message block locations.
- sourceUrl (URL of the source if the article is re-posted)：Not adopted.
- content (article content)：Adopted. One of the primary sources for the analysis.
- cover (link to the article’s cover image)：Not adopted.
- digest (abstract of the article)：Not adopted.
Owners of the Wechat subscription account being analyzed may also have the option to add fields to help with analysis:
- Number of current followers：As the number of followers is continuously changing, this affects the base number of users who can get a push for newly published articles. The more subscribers, the more reads and likes an article can get. We calculated the relative article popularity using popularity score divided by the corresponding number of followers to minimize the effects of the changing number of followers.
- Category of the article：Prolific writers may have many articles in different categories that should not be measured on the same scale. For example a breaking news article will likely be more popular than an academic analysis article. This field can filter particular categories of articles for better analysis. (In Synced’s case, as our article categories are often marked with an “|” in the title, the script is written to detect the category by looking at the part before this “|”.)
For influential word analysis, users can change the following parameters to fit their needs:
- ‘like_weight‘: Users can pick a number between 0 and 1 (0-100%) to decide how much the like weight of an article will contribute to its popularity calculation: “1” uses only the number of likes while “0” uses only the number of reads.
- ‘title_weight’: Users can pick a number between 0 and 1 to decide how much weight will be given to the words extracted from the title. “0” uses only words extracted from the content (not recommended), and “1” uses only words extracted from the title (recommended when precision is prioritized over recall).
- ‘cut_method‘: Two options are available to calculate the word weights respective to every article: TextRank and tf-idf.
- ‘hot_method‘: Three options are available to compile the popularity score of a word when it occurs in multiple articles: average, summary, and medium.
- ‘normalize_rd_lk‘ and ‘normalize_title_content‘: Users have the option to enable the normalization of the number of reads/likes and the title/content weights of an article. Normalization is provided because the scale of the read/like numbers are different (number of reads can be as high as 10w+, while the number of likes tops at around thousands), so that even if the weight of likes is very high, their contribution to popularity may remain low.
Before running the script for analysis, users need to define paths for the input, output and all relevant parameters. The script can then be easily run via:
Enjoy the result
The SyncedLeg model will write the output directly to an Excel document for ease of review and analysis. The output document will be named based on the parameters selected (e.g. “2018_avg_tfidf_lw0.8_tw1_norm-rl_nofoll_norm-tc.xls” means that users chose average as the hot_method, tfidf as the cut_method, a like weight of 0.8, title weight 1, with normalization enabled and followers disabled).
The output will not only contain the ranked influential keywords mined from the provided corpus, but also the calculated popularity score (higher means more influence), corresponding titles of the source articles, and keyword tags.
The tags are based on the dictionaries in the dict folder. There are two NER (Named Entity Recognition) based dictionaries for names of people and organizations, so that output data can be filtered with different tags based on user requirements. Users may also add their own dictionaries to improve tagging variety and accuracy.
Judging from the test results, there is still space for improvement in the toolkit. For example, although SyncedLeg currently uses jieba as its default NER tool, we have recently discovered that some learning-based algorithms such as bi-LSTM CRF can substantially improve recall and optimize overall performance of both word extraction and word tagging. Word embedding is another tweak worth trying as there are many promising pre-trained models now available, such as the Tencent AI Lab Embedding Corpus for Chinese Words and Phrases.
SyncedLeg is open-sourced on Github, with toolkit details and related algorithms in the ReadMe file.
SyncedLeg is produced by the Synced Data Intelligence Lab. Data Analyst Interns Chao Wen, VXenomac, JJ Weng; Data analyst Mos Zhang; Producer Chain Zhang. Any suggestions or contributions are very welcome.
Author: Mos Zhang | Editor: Michael Sarazen