Dabbling in Natural Language Processing using publications by IRRI staff

I first encountered natural language processing (NLP) in data science class, with a classmate's project opening my eyes to the possibilities of using the method. Before that class, I had worked on determining descriptors for rice varieties by finding the most frequently associated adjectives and finding which rice descriptors were co-occurring with viand names. I even published a paper on rice descriptors.

Anyway, during class, I decided to embark on a massive (in my opinion when I was starting it) side project to sharpen my skills in NLP. I didn't realise how deep the NLP hole was until I started working on it.

Sourcing the documents
I remembered that IRRI lists its staff's publications online. It was just a matter of accessing the information and putting them in an Excel spreadsheet by copying and pasting...

Wrong!

There are over 6000 articles in the list of publications. There was no way I was going to be able to manually copy the information from the website and paste it in a spreadsheet and do it fast. So I decided to retrieve the title, publication year, and URL of each article via web scraping. And then I realised that the abstracts (summaries) were not included in the publications' details! Hence, I had to do a second round of web scraping.

However, the second round of web scraping (to get the abstracts) was more complicated than the previous one. While looking at how the websites were structured, I realised that there are journal sites that are similarly formatted; this means that I could create a web crawler that would scrape multiple journals. I also learned that there were journals which did not allow programmatic web scraping at all (crawlers were blocked from accessing the information); for articles belonging to these journals, I had to collect the data manually.

After automating most of the data collection by using the BeautifulSoup module in Python, I was able to collect all that I needed in a shorter time than I expected. In total, as of this writing, I was able to get data for 5793 out of 6230 articles from the IRRI Publications website. Filtering out the articles that don't have abstracts, I was able to use 5189 entries... still a decent sized collection of abstracts.

Number of publications
I found that taking a look at the number of publications per year to be interesting because it shows that IRRI staff have been increasingly publishing their research, making their findings accessible to scientists all over the world. From 2009 onwards (except 2020, which has just begun), there are alternating years of high and slightly lower number of publications. I'm guessing that the years with lower numbers were the years when the scientists were gathering evidence and crunching numbers; the years with higher numbers were when the scientists were publishing their findings.


Cleaning the data
Data cleaning involved filtering out "stop words", numbers, punctuations, and words that are less than four letters long; creating a list of tokens (words) per article; converting the words to their base forms (lemmatisation); and keeping nouns only (part of speech, POS). On the other hand, there are some words that co-occur so often that they can be treated as one term (for example, Oryza and sativa is considered as oryza_sativa).

There are several Python NLP packages to choose from. I opted to use the Natural Language Toolkit and GenSim for data cleaning (and topic modelling). It is able to filter text data to its "essence" (for my purposes, the nouns), particularly from the abstracts.

Through the process of NLP, I started appreciating how information-rich abstracts really are... AND I did not have to read all the 5000+ articles to find trends. I'm not saying that it's good to not read the abstracts; my point is: if I can automate the small stuff, I can start focusing on understanding the trends in the bigger scheme of things. For example...

Which countries are covered by these studies?
An interesting topic for me is all about where the studies were conducted. I extracted the countries mentioned in each abstract through a function that I custom-built, counted the number of mentions for each country, and visualised these numbers in a Leaflet map using the folium package in Python. The intensity of a colour indicates the number of studies mentioning a country. This is what is called a choropleth map.



The results indicate that India is the most mentioned in the abstracts, followed by China, Bangladesh, the Philippines, Vietnam, Thailand, and Indonesia. These countries are among the world's top rice producers; naturally, these countries benefit the most from the technological advances that IRRI and its partner institutes are developing.

Extracting keywords
But what research areas are these countries drawing benefits from? To get to the topics, I had to extract the keywords from each abstract (again, another taxing job that can be automated through Python).

There are many nouns in a 250-word abstract. But which ones actually count as important enough to be considered a keyword? Instinctively, one would think that the most frequently used words in an abstract are the most important words. However, it turns out that if words occur too frequently across many abstracts (e.g., more than 85% of the abstracts), the term may not be that important; that is, the term does not distinguish the article from the others. Obviously, if the term occurs in less than 10% of the abstracts, it's not important. To determine the importance of the remaining terms after data cleaning, I calculated the "term frequency-inverse document frequency" (TF-IDF).

Topic modelling
I used the GenSim package to assign each article into topics based on latent Dirichlet allocation (LDA). Simply put, I tried to programmatically discovery the topics found in each abstract. LDA assumes that each abstract contains a mixture of topics, with some topics more probable than the others. Note that topic modelling is akin to clustering the abstracts but each abstract can be in multiple topics ("soft" clustering). The topic with the highest probability is then assigned to each abstract. In contrast, k-means clustering or hierarchical clustering assigns each abstract to only one cluster ("hard" clustering). 

Anyway, the analysis showed that the abstracts could be clustered into five distinct topics (I could have asked for more topics but then the topics would overlap). Overall, yield is the most salient (important) term among the five topics, suggesting that improving yield is one of the most important goals of IRRI research. This is not surprising because the institute's mission is to improve the food security status of rice consumers. The second most salient term is soil. This suggests that addressing soil issues is a way of ensuring yield and food security.



Looking at the keywords included in each topic allowed me to infer what the major research themes were in these abstracts. Based on the donut chart, input management comprised around a third of the research published by IRRI scientists, followed by breeding for stress tolerance. Crop modelling, a means to predict crop performance in various environmental conditions, had the smallest proportion of abstracts.




Of course, these aren't the only topics being studied at IRRI. It's quite obvious that grain quality did not figure prominently; I'm sure that many grain quality articles can surely be found in rice physiology (e.g., grain filling and chalkiness), rice genetics (e.g., genes that are associated with various grain quality attributes), and input management (e.g., effects of fertiliser inputs on cooked grain texture). Nutrition also did not pop as a distinct topic despite the push to use rice as a delivery system for improved nutrition for undernourished populations (e.g., high-iron, high-zinc, Vitamin A-containing rice). Again, I assume that this is because these nutrition articles can be found within the rice genetics topic because most of the studies are still on understanding the genes linked with increased nutrients in the grain. Publications on postharvest technologies, decision tools, economic benefits and tradeoffs, and consumer research haven't figured in either. Perhaps, these topics will become more prominent in the future, when the spotlight has been widened to include rice consumers as major targets of research; the results have demonstrated that farmers have been the major beneficiaries of IRRI research so far.

I'm amazed at the potential of natural language processing in extracting insights from large collections of text data. I can easily find the main themes from the entire collection, as shown here, or to explore how the themes have changed over time (I've dabbled into that but I didn't include it in this post). I can create filters to pick only those articles that are relevant to my interests (e.g., grain quality) and create a two-sentence summary for each abstract by ranking the individual sentences (I also did this but I didn't include the results here).

Frankly, I can't believe that I've started digging into the field of artificial intelligence!

Comments

Popular posts from this blog

Skyflakes

10 things I learned while driving on Marcos Highway to Baguio City

Surat Mangyan