Sentiment Analysis for the Masses – How LLMs Changed the Game

This is a personal copy of a column in IEEE Software (Jan/Feb 2025). Republished with permission.

This column turns its attention to large-scale sentiment analysis, a topic that has been popular in requirements engineering over the past decade. Now is a particularly good time to revisit it, as large language models made this type of analysis accessible to all of us. Most importantly, techniques that used to have a high barrier are now just a few prompts away. Time to reap the benefits! However, our reference to “the masses” caters to all levels of expertise – even seasoned analysts training custom models will find novel insights in the column.


Markus Borg, Tom Richter

The web can be a deep well of information. It’s rich and full of insights, but turning the crank enough times to bring them to the surface can be hard work. And just like you might want to filter water from a well before drinking, scraping online sources often requires filtering and preprocessing to create real value.

If you have a consumer-facing software product, your users are likely sharing their thoughts across various forums. They may comment on features via social media – or complain about what’s missing. Nasty bugs can be openly discussed in public. There may be community-managed forums where users compare products and provide peer support. Marketplace product reviews are another valuable source of user feedback.

Numerous products and services are available to analyze the feedback shared by users online. Customer experience platforms and social media monitoring tools are widely available. The major cloud platforms, such as Google Cloud, Microsoft Azure, and Amazon Web Services let users build solutions using their Natural Language Processing (NLP) toolkits.

Automated Opinion Mining

It is noteworthy that the two last IEEE International Requirements Engineering (RE) conferences both have recognized work on opinion mining through Most Influential Paper awards. At RE, this prestigious award goes to “the paper published at RE ten years ago which had the biggest influence in the RE community.”

In 2023, the award went to “User Feedback in the AppStore: An Empirical Study” by Pagano and Maalej [1]. Their groundbreaking 2013 paper analyzed more than a million mobile app reviews, laying the foundation for a new research direction. From an RE perspective, they demonstrated how the insights from review mining provide an understanding of how products are actually being used – so useful for prioritization and roadmapping.

In 2024, the award went to “How Do Users Like This Feature? A Fine-Grained Sentiment Analysis of App Reviews” by Guzman and Maalej [2]. The authors introduced the concept of sentiment analysis and extracted user opinions – positive or negative – about individual app features. This allowed app providers to move beyond generic star ratings to instead get detailed insights into which features users love or hate. A true RE goldmine!

The RE community can soon enjoy an upcoming book chapter by Maalej and colleagues that summarizes the field [3]. A preprint is already available, and we find it so telling that many sections conclude a long historical outlook of previous work – describing the enabling NLP techniques – with roughly “now you can do the same with Large Language Models (LLM).” Always followed by an example prompt for the reader to try. This column will share similar examples.

Beginnings of Sentiment Analysis

Sentiment analysis means analyzing digital text to determine if its emotional tone is positive, negative, or neutral. For a given chunk of text, often just a few words, the output is typically presented with a polarity for example between -10 and +10.

Sentiment analysis was initially rule-based, relying on a dictionary of polarizing terms. For example, if a user expressed words like intuitive, reliable, excellent, or epic, the output sentiment would be positive. Conversely, terms like slow, clunky, laggy, and boring would yield a negative sentiment. Each term carried its own polarity, reflecting the intensity of the sentiment it conveyed. Rule-based solutions are straightforward but are tricked by negations and cannot detect sarcasm.

Training Machine Learning (ML) classifiers was an alternative to the rule-based systems. Humans labeled the sentiments of numerous examples, and well-established models such as support vector machines and random forest classifiers could learn to replicate human sentiment judgments. The focus shifted from maintaining dictionaries to curating labeled examples. When trained on diverse datasets, this often led to more robust results.

Neural Networks Deepened

Deep neural networks entered the scene and changed the world forever. NLP was one application area fundamentally transformed. Previously, various handcrafted linguistic tricks were sometimes applied to boost results. But as the neural networks deepened, and were fueled by Internet-scale data, these learning machines began to outperform all sorts of human linguistic craftsmanship. The neural networks could instead learn suitable embeddings directly from the text.

A word embedding is simply a computer’s representation of a word. In its simplest form, each unique word in a textual dataset corresponds to a unique dimension. But such a representation doesn’t capture the meanings words derive from their relationship with each other. This type of relational information is ideal to embed when training ML models.

Deep neural networks excel at identifying patterns from big data, and natural language text is no exception. They are excellent at learning how the context of a word matters. This is accurately captured in modern embeddings, which represent semantic similarities between words. These embeddings are used as features for training ML models, significantly enhancing their performance. We have seen new records in various benchmark tasks such as question answering, text classification – and certainly sentiment analysis.

LLMs for the Masses

The progress in sentiment analysis has led to an increasingly hands-off approach [1]. This evolution began with hand-crafted dictionaries, progressed through ML enhanced by linguistic engineering, and reached deep learning fueled by large datasets. Then ChatGPT came with a disruptive human-friendly dialog mode. Now, anyone can interact with the most capable LLMs in plain language. But of course, manually entering reviews in a chat dialog doesn’t scale.

Luckily, with some developer skills, using APIs to prompt LLMs is almost equally simple today. The general structure of prompts applies well to sentiment analysis as well. That is:

  1. Some context to steer the model toward the right parts of its knowledge.
  2. An instruction for the task you want the LLM to perform.
  3. Input data for which you want the task to be completed.
  4. An output indicator to specify how the response should be given.

Example A in the figure below shows a prompt using these four standard building blocks. The application domain is mobile apps and the example classifies according to a simple three-level polarity scale and returns the answer followed by an explanation. There is very limited effort involved, and the solution is accurate enough for many types of analysis.

Drilling Down into Aspects

Traditional sentiment analysis assigns a single label to the entire text. A natural evolution is Aspect-Based Sentiment Analysis (ABSA) – a more fine-grained variant of opinion mining. ABSA focuses on extracting sentiment for specific aspects within a text. But which aspects? This, naturally, depends on the application domain.

ABSA involves some form of aspect identification. For some applications, such as mobile games, it might be possible to pre-determine a closed list of relevant aspects. Examples might include gameplay, graphics, and replayability. Alternatively, you can train ML classifiers for aspect identification – but contemporary LLMs often perform this task well right out of the box.

Example B in the figure shows an example of ABSA prompting. This prompt uses the same four building blocks as the previous example but introduces three key variations. First, we request sentiment judgments on a fine-grained scale from -10 to 10. Second, the prompt now includes multiple reviews instead of just one. Third, the output is requested in a standardized JSON format, making it easy to programmatically process in subsequent steps of the pipeline. Once again, substantial LLM power can be harnessed with just a few keystrokes.

The Downside? Token Costs

What about the drawbacks of the LLM approach? The major technology providers are still learning how to best monetize LLM-based services. The driving force behind this market is the cost per token—fragments of language, such as words or parts of words, used in both input and output.

When scaling up, LLM users must learn to strike a balance between crafting long, informative prompts and managing costs. Different providers, like Google and Anthropic, have varying pricing models. More importantly, differently capable LLMs come with very different price tags per token, reflecting their computational demands.

At the time of writing, OpenAI’s new o1 model has just been released, boasting improved reasoning capabilities. However, the release included the fine print “internal reasoning tokens generated by the model are not visible in API responses”—a problematic move for cost-aware users looking to build solutions on top of this LLM.

Budget-conscious? Try DIY

Advanced users looking to minimize sentiment analysis costs can explore other options. Smaller models tailored for the task can be deployed on your own infrastructure and perform competitively. Once deployed, you can use it without thinking about skyrocketing token counts. If you expect to analyze numerous reviews, this can really make a difference on the bottom line compared to sending one API request per example.

A great starting point is to try pre-trained open-source models fine-tuned for ABSA using public datasets. Hugging Face is your friend here, a popular platform for sharing and deploying ML models. If the user feedback you’re targeting is aligned with general linguistic patterns, these free models will probably perform well.

Perhaps you indeed find unique expressions of sentiment within your domain. Maybe you’re operating in a niche market with a specialized language? You could then invest in manually annotating a sample. Such a dataset could then be used to fine-tune the open-source ABSA models a second time, tailoring them for your context. Chances are that around 1,000 examples can substantially boost your model’s accuracy.

Going Small Using Distillation

Expert users interested in state-of-the-art ABSA training for small models have another interesting option to consider: distillation [4]. It’s another approach with the goal to create an efficient (as in small) model that balances cost with accuracy.

The distillation process trains a small “student” model to mimic the output of a larger “teacher” LLM. Distilling step-by-step is a state-of-the-art innovation in this field developed at Google Research [5]. The core idea is to supplement the teacher’s output label with rationales explaining its internal reasoning.

For sentiment analysis, this involves training a model that not only predicts the user opinion expressed in text but also the underlying reasons. During training, the student is tasked with both 1) predicting the correct sentiment and 2) generating corresponding rationales. This method, inspired by human learning, aims to guide the student model to more accurate predictions.

We prototyped distilling step-by-step for training an ABSA model on mobile app reviews, using OpenAI’s GPT-3.5-turbo as the teacher and Google’s t5-large as the student. First, the teacher generated ABSA labels with rationales for 35,000 app reviews. Second, the student learned to replicate the teacher’s sentiment predictions and rationales for a subset of the reviews. How did it perform on unseen reviews?

As expected, the student model didn’t match the teacher in ABSA accuracy. Still, our feasibility study yielded encouraging results. The student model, which learned using rationales, outperformed a baseline model trained without them. After a careful analysis of a few hundred reviews, our findings are promising. With just a few tens of dollars spent on distilling at a popular cloud provider, we now have our own ABSA service up and running. Amazing times!

Ten years after the two Most Influential Paper awards at the IEEE RE conference, the world is ready for sentiment analysis at scale. Some caution is needed, as any automated analysis – just as human analysis – makes mistakes sometimes. We must focus on general trends and continuously monitor how accurately the analysis pipelines perform.

This column has presented the evolution of sentiment analysis and provided LLM-based entry points for novices, advanced users, and experts alike. Where does your organization stand? Have you managed to systematically harness the value of large-scale opinion mining? And how do you manage the involved risks?  We’d love to hear from you!

References

  • [1] D. Pagano and W. Maalej, User Feedback in the AppStore: An Empirical Study, In Proc. of the 21st IEEE International Conference on Requirements Engineering, pp. 125-134, 2013.
  • [2] E. Guzman and W. Maalej, How Do Users Like This Feature? A Fine Grained Sentiment Analysis of App Reviews, In Proc. of the 22nd IEEE International Conference on Requirements Engineering, pp. 153-162, 2014.
  • [3] W. Maalej, V. Biryuk, J. Wei, and F. Panse, On the Automated Processing of User Feedback, A. Ferrari and G. Ginde (Eds): Handbook of Natural Language Processing for Requirements Engineering, Springer Nature Switzerland, 2024.
  • [4] Lehečka, J., Švec, J., Ircing, P. and Šmídl, L., BERT-based Sentiment Analysis Using Distillation. In Proc. of the International Conference on Statistical Language and Speech Processing, pp. 58-70, 2020.
  • [5] Hsieh, C.Y., Li, C.L., Yeh, C.K., Nakhost, H., Fujii, Y., Ratner, A., Krishna, R., Lee, C.Y. and Pfister, T., July. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 8003-8017, 2023.