In response to the COVID-19 pandemic, Andrej Karpathy — director of artificial intelligence and Autopilot Vision at Tesla and developer of the arXiv sanity preserver web interface — has introduced Covid-Sanity, a web interface designed to navigate the flood of bioRxiv and medRxiv COVID-19 papers and make the research within more searchable and sortable.
BioRxiv is a free online archive and distribution service for unpublished life science preprints. Scholars can upload their papers before they submit to journals, and get feedback within 24 hours. The MedRxiv server distributes unpublished health sciences manuscripts. These platforms skip the time-consuming review process and enable other scientists to read updated papers immediately, which is beneficial for sharing urgent research.
Covid-Sanity organizes COVID-19-related papers with a “most similar” search that uses an exemplar SVM trained on TF-IDF feature vectors from the abstracts of the papers. This is similar to the Google search engine, which responds by finding the relevance of the query in all texts, ranks by similarity scores and returns the top-k results. Based on paper abstracts, the web interface returns all papers similar to the best-matched paper result to a query.
TF-IDF (Term Frequency — Inverse Document Frequency) is a method for identifying and signifying important words in a document and corpus. For example, given the sentence “She is beautiful”, humans can immediately understand the semantics. But how might a computer understand? Computers understand data only in the form of numerical values, so the problem becomes: how to translate such sentences into vectors? The Covid-Sanity project similarly needed to find a way to represent important words in paper abstracts. They did that with TF-IDF, which has a concise formula: TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF).
Term Frequency (TF) measures the frequency of words in abstracts. This alone is not so useful, as meaning-poor words like “is” and “the” will have high values but not contribute much to the key features of the abstracts. Such words of course can simply be designated as “stop-words” and removed in the data pre-processing stage, but it is better to have the more sophisticated TF-IDF method automatically determine what words are and are not important. This is what the Inverse Document Frequency (IDF) tends to do: lower the ranking score of such “stop-words”. As we can see: IDF(t) = log(N/(df + 1)), N represents the total number of documents, and df refers to the occurrence of a target term t. So the value of IDF will be very low for the most frequently occurring words. Finally, by taking a multiplicative value of TF and IDF, we can extract the top-k most significant words from the abstracts.
The next step is using exemplar SVM to build top-k similar papers. Unlike original SVM methods which find the line/plane/hyperplane that best separates positive and negative samples, an exemplar SVM uses a one-class classifier that is trained using positive samples for one class and an enormous collection of negative samples for others, which is simpler and leads to higher precision. Here, the positive samples are the TF-IDF features of the target abstract.
User feedback thus far suggests the Covid-Sanity recommendation system is useful, effective, and informative. The GitHub repository garnered over 100 stars in just one day.
The Covid-Sanity project is on GitHub.
Author: Hecate He | Editor: Michael Sarazen