One of the challenges for most projects is to find a good dataset for the given goal. This is why I think that there is tremendous value in creating basic tools that enable the acquisition of data. When it comes to gaining more knowledge about medical conditions like Covid-19, resources like PubMed provide a valuable source of information that can be mined with NLP.
What it does
This script takes in a search term as well as an integer for the desired number of results. Using these two arguments The script returns and saves a comma separate table with the columns: title, name of authors, journal, date, DOI and abstract.
How I built it
Using the two input arguments we send a request top PubMed. I then carefully extract and compose the relavent information for each article. This is done in a loop for each article. The final data is put into a pandas dataframe before it gets saved as a CSV.
Challenges I ran into
The requests returns a lot of different information and it is tedious to make sure to extract the relevant parts. Furthermore, legacy Python that doesn’t verify HTTPS certificates by default and (workaround needed).
Accomplishments that I'm proud of
It is very sattisfying to build a simple yet useful tool and provide it to others.
What I learned
I learned about the importance of properly using other libraries like Beautifulsoup.
What's next for PubMed Web Scraper
Extend it with more functionality and maybe enable other websites.
Try It out
beautiful-soup, pandas, python, requests