Site icon Synced

Word Influence Analysis Tool ‘SyncedLeg’ Open-Sourced

Traffic data analysis provides insights on content interaction that help publishers better understand their readers and is a valuable tool for anyone writing and publishing articles on the Internet. Traffic data analysis is however usually article-based, and as such can only reveal relative article popularity — it does not provide information on which specific part(s) of an article contributed to the traffic.

SyncedLeg is a tool designed to help with that by mining influential keywords from the corpus with traffic data. A team of Synced interns developed the tool over an internal two-day Hackathon, naming it after their team “机器之腿” (“Machine’s leg” in Chinese).

With SyncedLeg, users can:

Feeding Syncedleg

As input, we used articles published on the Synced WeChat subscription account – “机器之心” (WeChat subscription accounts are among the largest content distribution platforms in China, reaching millions of mobile users). It is also possible to use articles from other sources that contain similar features to perform the analysis. (Users may want to try other tools for pre-processing and word extraction to improve results with articles in languages other than Simplified Chinese.)

Articles from WeChat subscription accounts usually contain the following fields, which users can choose to include or ignore for analysis in SyncedLeg:

Sample message block pushed to subscribers on a WeChat subscription account
Choices available for article selection and analysis

Owners of the Wechat subscription account being analyzed may also have the option to add fields to help with analysis:

Categorizing Synced articles marked with “|”

Tuning Syncedleg

For influential word analysis, users can change the following parameters to fit their needs:

Formula to calculate the popularity of an article
Formula to calculate the popularity score of a word
Formula for the normalization of reads and likes in which x is the reads/likes of an article
Parameters that can be tuned for different analytical purposes

Running Syncedleg

Paths that need to be defined before analysis

Before running the script for analysis, users need to define paths for the input, output and all relevant parameters. The script can then be easily run via:

python Analyser.py

Enjoy the result

The SyncedLeg model will write the output directly to an Excel document for ease of review and analysis. The output document will be named based on the parameters selected (e.g. “2018_avg_tfidf_lw0.8_tw1_norm-rl_nofoll_norm-tc.xls” means that users chose average as the hot_method, tfidf as the cut_method, a like weight of 0.8, title weight 1, with normalization enabled and followers disabled).

The output will not only contain the ranked influential keywords mined from the provided corpus, but also the calculated popularity score (higher means more influence), corresponding titles of the source articles, and keyword tags.

The tags are based on the dictionaries in the dict folder. There are two NER (Named Entity Recognition) based dictionaries for names of people and organizations, so that output data can be filtered with different tags based on user requirements. Users may also add their own dictionaries to improve tagging variety and accuracy.

Top 10 output keywords with tf-idf using sample data
Top 10 output keywords with TextRank using sample data

Judging from the test results, there is still space for improvement in the toolkit. For example, although SyncedLeg currently uses jieba as its default NER tool, we have recently discovered that some learning-based algorithms such as bi-LSTM CRF can substantially improve recall and optimize overall performance of both word extraction and word tagging. Word embedding is another tweak worth trying as there are many promising pre-trained models now available, such as the Tencent AI Lab Embedding Corpus for Chinese Words and Phrases.

SyncedLeg is open-sourced on Github, with toolkit details and related algorithms in the ReadMe file.

SyncedLeg is produced by the Synced Data Intelligence Lab. Data Analyst Interns Chao Wen, VXenomac, JJ Weng; Data analyst Mos Zhang; Producer Chain Zhang. Any suggestions or contributions are very welcome.


Author: Mos Zhang | Editor: Michael Sarazen

Exit mobile version