As a researcher, I've always had hard time finding similar articles to compare information. Let's say I built a model for machine translation and want to know how other models perform. We usually search online, find a website listing state-of-the-art models and links to papers proposing the models, click the links and manually find the section describing the details of models. This process is educational but time consuming, especially when we need information instantly.
What it does
COVID-19 Scholarly-article Network (CSN) Searcher is designed specifically to search for COVID-19 related articles. It's distinct from other services in that it search for similar articles based on section similarity. The articles are the core of nodes and sections in the same article are moons of the article. CSN searcher searches for articles by checking the sections instead of going through the whole document. This is because one research article (e.g. more than 1000 words) is too long for Deep Learning models to encode, but many models are proven to work well for section-level documents (e.g. 200 words). CSN search uses a Siamese RNN model to encode documents specifically for similarity matching.
How to run
The following website shows how to run the program. https://github.com/box-key/csn-searcher#how-to-run
How I built it
So far, CSN search provides command line tools for demo. It's built on Python and PyTorch. The design principal is Object Oriented Programming. It stores the network in the memory every time user queries. I used CORD-19 dataset provided by AIlen Institute of AI to build the network. For the training, I used SICK dataset.
Example of search is the following
csn-search --input-file input.txt \ --num-search 3
input.txt contains the body of input section.
It employs Siamese (Manhattan) LSTM model to compute document similarity. Although the model needs to be fine-tuned for COVID-19 articles, it shows Mean Squared Error of 0.58 on SICK dataset with 4500 corpus (the paper which proposed the architecture by Mueller and Thyagarajan (2016) reported MSE of 0.2286 on the same dataset).
What's next for CSN Searcher
There are three points I would like to work on to productionize CSN searcher:
Suboptimal SearchCSN search computes the exponent of Manhattan distance (i.e. l1 norm) between input and all sections in the network. Thus, the computation time linearly increases as the network expands. We'll continuously add articles to the network, so we need a sub-optimal search in practice. Although I'm willing to discuss further, my idea is to utilize citations in papers. In this way, we can build a graph where each node is section and edges represent citation between papers. For example, the algorithm finds cited papers in input section, finds the most similar sections in those papers, then it searches for cited papers in the most similar articles. Intuitively, this process is similar to what researchers usually search for information.
Database Integration and User InterfaceSince CSN searcher loads the network and the model (about 1GB so far) into the memory every time users query, it delays the searching process. Thus, we need to store them into a database and users can query the database. It also needs better user interface like html page.
Improve RNN modelIt's shown by Yang et al (2016) that Hierarchical Attention model works considerably better for longer documents. My idea is to build a Siamese Hierarchical Attention Network model. The codes are almost ready, but we need proper datasets to train the model.
Challenges I ran into
It was hard to manage the project in such a short period of time by myself. Although the codes for Hierarchical Attention model and sub-optimal search are ready, I couldn't include them into my submission because I had some errors and couldn't fix them by the deadline. I also strongly felt the importance of team working. I learned that I lack some skills to deploy applications in production settings (e.g. database integration, UI development, etc.). If CSN search could interest you, I'd be happy to work with you to realize this idea, that could potentially benefit many COVID-19 researchers.
Try It out
argparse, python, pytorch, scispacy