Analyze Scientific Publications with E-utilities and Python | by Jozsef Meszaros | Could, 2023
To question an NCBI database successfully, you’ll need to find out about sure E-utilities, outline your search fields, and select your search parameters — which management the way in which outcomes are returned to your browser or in our case, we’ll use Python to question the databases.
4 most helpful E-utilities
There are 9 E-utilities accessible from NCBI, and they’re all applied as server-side quick CGI applications. This implies you’ll entry them by creating URLs which finish in .cgi and specify question parameters after a question-mark, with parameters separated by ampersands. All of them, apart from EFetch, offers you both XML or JSON outputs.
ESearch
generates a listing of ID numbers that meet your search question
The next E-Utilities can be utilized with a number of ID numbers:
ESummary
journal, creator record, grants, dates, references, publication kindEFetch
**XML ONLY** all of whatESummary
offers in addition to an summary, record of grants used within the analysis, establishments of authors, and MeSH key phrasesELink
offers a listing of hyperlinks to associated citations utilizing computed similarity rating in addition to offering a hyperlink to the revealed merchandise [your gateway to the full-text of the article]
The NCBI hosts 38 databases throughout their servers, associated to a wide range of knowledge that goes past literature citations. To get an entire record of present databases, you should utilize EInfo with out search phrases:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi
Every database will fluctuate in how it may be accessed and the data it returns. For our functions, we’ll concentrate on the pubmed and pmc databases as a result of these are the place scientific literature are searched and retrieved.
The 2 most vital issues to find out about looking NCBI are search fields and outputs. The search fields are numerous and can depend upon the database. The outputs are extra simple and studying easy methods to use the outputs is crucial, particularly for doing massive searches.
Search fields
You received’t be capable of really harness the potential of E-utilities with out understanding in regards to the accessible search fields. Yow will discover a full record of those search fields on the NLM website together with an outline of every, however for the most correct record of search phrases particular to a database, you’ll need to parse your personal XML record utilizing this hyperlink:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed
with the db flag set to the database (we are going to use pubmed for this text, however literature can be accessible by means of pmc).
)
One particularly helpful search area is the Medline Topic Headings (MeSH).[3] Indexers, who’re consultants within the area, preserve the PubMed database and use MeSH phrases to replicate the subject material of journal articles as they’re revealed. Every listed publication is usually described by 10 to 12 rigorously chosen MeSH phrases by the indexers. If no search phrases are specified, then queries can be executed towards each search time period accessible within the database queried.[4]
Question parameters
Every of the E-utilities accepts a number of question parameters by means of the URL line which you should utilize to regulate the kind and quantity of output returned from a question. That is the place you possibly can set the variety of search outcomes retrieved or the dates searched. Listed here are a listing of the extra vital parameters:
Database parameter:
db
ought to be set to the database you have an interest in looking — pubmed or pmc for scientific literature
Date parameters: You will get extra management over the date through the use of search fields, [pdat] for instance for the publication date, however date parameters present a extra handy option to constrain outcomes.
reldate
the times to be searched relative to the present date, set reldate=1 for the latest daymindate
andmaxdate
specify date in response to the format YYYY/MM/DD, YYYY, or YYYY/MM (a question should include eachmindate
andmaxdate
parameters)datetype
units the kind of date if you question by date — choices are ‘mdat’ (modification date), ‘pdat’ (publication date) and ‘edat’ (Entrez date)
Retrieval parameters:
rettype
the kind of data to return (for literature searches, use the default setting)retmode
format of the output (XML is the default, although all E-utilities besides fetch do help JSON)retmax
is the utmost variety of information to return — the default is 20 and the utmost worth is 10,000 (ten thousand)retstart
given a listing of hits for a question,retstart
specifies the index (helpful for when your search exceeds the ten thousand most)cmd
that is solely related toELink
and is used to specify whether or not to return IDs of comparable articles or URLs to full-texts
As soon as we all know in regards to the E-Utilities, have chosen our search fields, and determined upon question parameters, we’re able to execute queries and retailer the outcomes — even for a number of pages.
When you don’t particularly want to make use of Python to make use of the E-utilities, it does make it a lot simpler to parse, retailer, and analyze the outcomes of your queries. Right here’s easy methods to get began in your knowledge science challenge.
Let’s say you need to search MeSH phrases for the time period “myoglobin” between 2022 and 2023. You’ll set your retmax to 50 for now, however bear in mind the max is 10,000 and you may question at a charge of three/s.
import urllib.request
search_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/' +
f'?db=pubmed' +
f'&time period=myoglobin[mesh]' +
f'&mindate=2022' +
f'&maxdate=2023' +
f'&retmode=json' +
f'&retmax=50'link_list = urllib.request.urlopen(search_url).learn().decode('utf-8')
link_list
The outcomes are returned as a listing of IDs, which can be utilized in a subsequent search inside the database you queried. Word that “rely” exhibits there are 154 outcomes for this question, which you might use should you wished to get a complete rely of publications for a sure set of search phrases. If you happen to wished to return IDs for all of the publication, you’d set the retmax parameter to the rely, or 154. On the whole, I set this to a really excessive quantity so I can retrieve all the outcomes and retailer them.
Boolean looking is straightforward with PubMed and it solely requires including +OR+
, +NOT+
, or +AND+
to the URL between search phrases. Right here’s an instance beneath. For instance:
These search strings can constructed utilizing Python. Within the following steps, we’ll parse the outcomes utilizing Python’s json bundle to get the IDs for every of the publications returned. The IDs can then be used to create a string — this string of IDs can be utilized by the opposite E-Utilities to return details about the publications.
Use ESummary to return details about publications
The aim of ESummary is to return knowledge that you simply would possibly count on to see in a paper’s quotation (date of publication, web page numbers, authors, and so on). After getting a outcome within the type of a listing of IDs from ESearch (within the step above), you possibly can be a part of this record into an extended URL.
The restrict for a URL is 2048 characters, and every publication’s ID is 8 characters lengthy, so to be protected, you need to cut up your record of hyperlinks up into batches of 250 when you’ve got a listing bigger than 250 IDs. See my pocket book on the backside of the article for an instance.
The outcomes from an ESummary are returned in JSON format and may embrace a hyperlink to the paper’s full-text:
import json
outcome = json.masses( link_list )
id_list = ','.be a part of( outcome['esearchresult']['idlist'] )summary_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esummary.fcgi?db=pubmed&id={id_list}&retmode=json'
summary_list = urllib.request.urlopen(summary_url).learn().decode('utf-8')
We will once more use json to parse summary_list. When utilizing the json bundle, you possibly can browse the fields of every particular person article through the use of abstract[‘result’][id as string], as within the instance beneath:
abstract = json.masses( summary_list )
abstract['result']['37047528']
We will create a dataframe to seize the ID for every article together with the title of the journal, the publication date, title of the article, a URL for retrieving the total textual content, in addition to the primary and final creator.
uid = [ x for x in summary['result'] if x != 'uids' ]
journals = [ summary['result'][x]['fulljournalname'] for x in abstract['result'] if x != 'uids' ]
titles = [ summary['result'][x]['title'] for x in abstract['result'] if x != 'uids' ]
first_authors = [ summary['result'][x]['sortfirstauthor'] for x in abstract['result'] if x != 'uids' ]
last_authors = [ summary['result'][x]['lastauthor'] for x in abstract['result'] if x != 'uids' ]
hyperlinks = [ summary['result'][x]['elocationid'] for x in abstract['result'] if x != 'uids' ]
pubdates = [ summary['result'][x]['pubdate'] for x in abstract['result'] if x != 'uids' ]hyperlinks = [ re.sub('doi:s','http://dx.doi.org/',x) for x in links ]
results_df = pd.DataFrame( {'ID':uid,'Journal':journals,'PublicationDate':pubdates,'Title':titles,'URL':hyperlinks,'FirstAuthor':first_authors,'LastAuthor':last_authors} )
Beneath is a listing of all of the totally different fields that ESummary returns so you may make your personal database:
'uid','pubdate','epubdate','supply','authors','lastauthor','title',
'sorttitle','quantity','problem','pages','lang','nlmuniqueid','issn',
'essn','pubtype','recordstatus','pubstatus','articleids','historical past',
'references','attributes','pmcrefcount','fulljournalname','elocationid',
'doctype','srccontriblist','booktitle','medium','version',
'publisherlocation','publishername','srcdate','reportnumber',
'availablefromurl','locationlabel','doccontriblist','docdate',
'bookname','chapter','sortpubdate','sortfirstauthor','vernaculartitle'
Use EFetch if you need abstracts, key phrases, and different particulars (XML output solely)
We will use EFetch to return related fields as ESummary, with the caveat that the result’s returned in XML solely. There are a number of attention-grabbing further fields in EFetch which embrace: the summary, author-selected key phrases, the Medline Subheadings (MeSH phrases), grants that sponsored the analysis, battle of curiosity statements, a listing of chemical substances used within the analysis, and an entire record of all of the references cited by the paper. Right here’s how you’ll use BeautifulSoup to acquire a few of these objects:
from bs4 import BeautifulSoup
import lxml
import pandas as pdabstract_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={id_list}'
abstract_ = urllib.request.urlopen(abstract_url).learn().decode('utf-8')
abstract_bs = BeautifulSoup(abstract_,options="xml")
articles_iterable = abstract_bs.find_all('PubmedArticle')
# Abstracts
abstract_texts = [ x.find('AbstractText').text for x in articles_iterable ]
# Battle of Curiosity statements
coi_texts = [ x.find('CoiStatement').text if x.find('CoiStatement') is not None else '' for x in articles_iterable ]
# MeSH phrases
meshheadings_all = record()
for article in articles_iterable:
outcome = article.discover('MeshHeadingList').find_all('MeshHeading')
meshheadings_all.append( [ x.text for x in result ] )
# ReferenceList
references_all = record()
for article in articles_:
if article.discover('ReferenceList') isn't None:
outcome = article.discover('ReferenceList').find_all('Quotation')
references_all.append( [ x.text for x in result ] )
else:
references_all.append( [] )
results_table = pd.DataFrame( {'COI':coi_texts, 'Summary':abstract_texts, 'MeSH_Terms':meshheadings_all, 'References':references_all} )
Now we are able to use this desk to go looking abstracts, battle of curiosity statements, or make visuals that join totally different fields of analysis utilizing MeSH headings and reference lists. There are in fact many different tags that you might discover, returned by EFetch, right here’s how one can see all of them utilizing BeautifulSoup:
efetch_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={id_list}'
efetch_result = urllib.request.urlopen( efetch_url ).learn().decode('utf-8')
efetch_bs = BeautifulSoup(efetch_result,options="xml")tags = efetch_bs.find_all()
for tag in tags:
print(tag)
Utilizing ELink to retrieve related publications, and full-text hyperlinks
Chances are you’ll need to discover articles much like those returned by your search question. These articles are grouped in response to a similarity rating utilizing a probabilistic topic-based mannequin.[5] To retrieve the similarity scores for a given ID, it’s essential to go cmd=neighbor_score in your name to ELink. Right here’s an instance for one article:
import urllib.request
import jsonid_ = '37055458'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_}&retmode=json&cmd=neighbor_score'
elinks = urllib.request.urlopen(elink_url).learn().decode('utf-8')
elinks_json = json.masses( elinks )
ids_=[];score_=[];
all_links = elinks_json['linksets'][0]['linksetdbs'][0]['links']
for hyperlink in all_links:
[ (ids_.append( link['id'] ),score_.append( hyperlink['score'] )) for id,s in hyperlink.objects() ]
pd.DataFrame( {'id':ids_,'rating':score_} ).drop_duplicates(['id','score'])
The opposite operate of ELink is to supply full-text hyperlinks to an article based mostly on its ID, which may be returned should you go cmd=prlinks to ELink as a substitute.
If you happen to want to entry solely these full-text hyperlinks which might be free to the general public, you’ll want to use hyperlinks that include “pmc” (PubMed Central). Accessing articles behind a paywall might require subscription by means of a College—earlier than downloading a big corpus of full-text articles by means of a paywall, you need to seek the advice of together with your group’s librarians.
Here’s a code snippet of how you might retrieve the hyperlinks for a publication:
id_ = '37055458'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_}&retmode=json&cmd=prlinks'
elinks = urllib.request.urlopen(elink_url).learn().decode('utf-8')elinks_json = json.masses( elinks )
[ x['url']['value'] for x in elinks_json['linksets'][0]['idurllist'][0]['objurls'] ]
You too can retrieve hyperlinks for a number of publications in a single name to ELink, as I present beneath:
id_list = '37055458,574140'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_list}&retmode=json&cmd=prlinks'
elinks = urllib.request.urlopen(elink_url).learn().decode('utf-8')elinks_json = json.masses( elinks )
elinks_json
urls_ = elinks_json['linksets'][0]['idurllist']
for url_ in urls_:
[ print( url_['id'], x['url']['value'] ) for x in url_['objurls'] ]
Sometimes, a scientific publication can be authored by somebody who’s a CEO, CSO, or CTO of an organization. With PubMed, we now have the power to research the most recent life science trade developments. Battle of curiosity statements, which had been launched as a search time period in PubMed throughout 2017,[6] give a lens into which author-provided key phrases are showing in publications the place an trade government is disclosed as an creator. In different phrases, the key phrases chosen by the authors to explain their discovering. To hold out this operate, merely embrace CEO[cois]+OR+CSO[cois]+OR+CTO[cois] as search time period in your URL, retrieve all the outcomes returned, and extract the key phrase from the ensuing XML output for every publication. Every publication accommodates between 4–8 key phrases. As soon as the corpus is generated, you possibly can quantify key phrase frequency per 12 months inside the corpus as the variety of publications in a 12 months specifying a key phrase, divided by the variety of publications for that 12 months.
For instance, if 10 publications record the key phrase “most cancers” and there are 1000 publications that 12 months, the key phrase frequency could be 0.001. Utilizing the seaborn clustermap module with the key phrase frequencies you possibly can generate a visualization the place darker bands point out a bigger worth of key phrase frequency/12 months (I’ve dropped COVID-19 and SARS-COV-2 from the visualization as they had been each represented at frequencies far larger 0.05, predictably).
From this visualization, a number of insights in regards to the corpus of publications with C-suite authors listed turns into clear. First, some of the distinct clusters (on the backside) accommodates key phrases which have been strongly represented within the corpus for the previous 5 years: most cancers, machine studying, biomarkers, synthetic intelligence — simply to call a couple of. Clearly, trade is closely energetic and publishing in these areas. A second cluster, close to the center of the determine, exhibits key phrases that disappeared from the corpus after 2018, together with bodily exercise, public well being, youngsters, mass spectrometry, and mhealth (or cellular well being). It’s to not say that these areas aren’t being developed in trade, simply that the publication exercise has slowed. Trying on the backside proper of the determine, you possibly can extract phrases which have appeared extra just lately within the corpus, together with liquid biopsy and precision medication — that are certainly two very “sizzling” areas of medication in the intervening time. By inspecting the publications additional, you might extract the names of the businesses and different data of curiosity. Beneath is the code I wrote to generate this visible:
import pandas as pd
import time
from bs4 import BeautifulSoup
import seaborn as sns
from matplotlib import pyplot as plt
import itertools
from collections import Counter
from numpy import array_split
from urllib.request import urlopenclass Searcher:
# Any occasion of searcher will seek for the phrases and return the variety of outcomes on a per 12 months foundation #
def __init__(self, start_, end_, term_, **kwargs):
self.raw_ = enter
self.name_ = 'searcher'
self.description_ = 'searcher'
self.duration_ = end_ - start_
self.start_ = start_
self.end_ = end_
self.term_ = term_
self.search_results = record()
self.count_by_year = record()
self.choices = record()
# Parse key phrase arguments
if 'rely' in kwargs and kwargs['count'] == 1:
self.choices = 'rettype=rely'
if 'retmax' in kwargs:
self.choices = f'retmax={kwargs["retmax"]}'
if 'run' in kwargs and kwargs['run'] == 1:
self.do_search()
self.parse_results()
def do_search(self):
datestr_ = [self.start_ + x for x in range(self.duration_)]
choices = "".be a part of(self.choices)
for 12 months in datestr_:
this_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/' +
f'?db=pubmed&time period={self.term_}' +
f'&mindate={12 months}&maxdate={12 months + 1}&{choices}'
print(this_url)
self.search_results.append(
urlopen(this_url).learn().decode('utf-8'))
time.sleep(.33)
def parse_results(self):
for lead to self.search_results:
xml_ = BeautifulSoup(outcome, options="xml")
self.count_by_year.append(xml_.discover('Rely').textual content)
self.ids = [id.text for id in xml_.find_all('Id')]
def __repr__(self):
return repr(f'Search PubMed from {self.start_} to {self.end_} with search phrases {self.term_}')
def __str__(self):
return self.description
# Create a listing which is able to include searchers, that retrieve outcomes for every of the search queries
searchers = record()
searchers.append(Searcher(2022, 2023, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2021, 2022, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2020, 2021, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2019, 2020, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2018, 2019, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
# Create a dictionary to retailer key phrases for all articles from a selected 12 months
keywords_dict = dict()
# Every searcher obtained outcomes for a selected begin and finish 12 months
# Iterate over searchers
for this_search in searchers:
# Break up the outcomes from one search into batches for URL formatting
chunk_size = 200
batches = array_split(this_search.ids, len(this_search.ids) // chunk_size + 1)
# Create a dict key for this searcher object based mostly on the years of protection
this_dict_key = f'{this_search.start_}to{this_search.end_}'
# Every worth within the dictionary can be a listing that will get appended with key phrases for every article
keywords_all = record()
for this_batch in batches:
ids_ = ','.be a part of(this_batch)
# Pull down the web site containing XML for all of the ends in a batch
abstract_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={ids_}'
abstract_ = urlopen(abstract_url).learn().decode('utf-8')
abstract_bs = BeautifulSoup(abstract_, options="xml")
articles_iterable = abstract_bs.find_all('PubmedArticle')
# Iterate over all the articles from the web site
for article in articles_iterable:
outcome = article.find_all('Key phrase')
if outcome isn't None:
keywords_all.append([x.text for x in result])
else:
keywords_all.append([])
# Take a break between batches!
time.sleep(1)
# As soon as all of the key phrases are assembled for a searcher, add them to the dictionary
keywords_dict[this_dict_key] = keywords_all
# Print the important thing as soon as it has been dumped to the pickle
print(this_dict_key)
# Restrict to phrases that appeared approx 5 instances or extra in any given 12 months
mapping_ = {'2018to2019':2018,'2019to2020':2019,'2020to2021':2020,'2021to2022':2021,'2022to2023':2022}
global_word_list = record()
for key_,value_ in keywords_dict.objects():
Ntitles = len( value_ )
flattened_list = record( itertools.chain(*value_) )
flattened_list = [ x.lower() for x in flattened_list ]
counter_ = Counter( flattened_list )
words_this_year = [ ( item , frequency/Ntitles , mapping_[key_] ) for merchandise, frequency in counter_.objects() if frequency/Ntitles >= .005 ]
global_word_list.prolong(words_this_year)
# Plot outcomes as clustermap
global_word_df = pd.DataFrame(global_word_list)
global_word_df.columns = ['word', 'frequency', 'year']
pivot_df = global_word_df.loc[:, ['word', 'year', 'frequency']].pivot(index="phrase", columns="12 months",
values="frequency").fillna(0)
pivot_df.drop('covid-19', axis=0, inplace=True)
pivot_df.drop('sars-cov-2', axis=0, inplace=True)
sns.set(font_scale=0.7)
plt.determine(figsize=(22, 2))
res = sns.clustermap(pivot_df, col_cluster=False, yticklabels=True, cbar=True)
After studying this text, you ought to be able to go from crafting extremely tailor-made search queries of the scientific literature all the way in which to producing knowledge visualizations for nearer scrutiny. Whereas there are different extra advanced methods to entry and retailer articles utilizing further options of the assorted E-utilities, I’ve tried to current essentially the most simple set of operations that ought to apply to most use circumstances for an information scientist occupied with scientific publishing developments. By familiarizing your self with the E-utilities as I’ve introduced right here, you’ll go far towards understanding the developments and connections inside scientific literature. As talked about, there are lots of objects past publications that may be unlocked by means of mastering the E-utilities and the way they function inside the bigger universe of NCBI databases.