Outperforming LLMs with 100x cheaper and faster models

Outperforming LLMs with 100x cheaper and faster models
Photo by Krzysztof Niewolny / Unsplash

Well, alright, this title is clickbait-y. I wanted to talk a little about one thing I've noticed recently. This is essentially subtweeting on a bunch of people I see on different platforms, that seem to be preoccupied by LLMs (and huge models in general) and appear to claim that they can now solve everything. However, it's important to not forget about the "classical" ML/NLP methods, since they are still powerful for many well-established tasks.

What this post will contain: a brief comparison of sentiment analysis – a well known and studied NLP problem – performance using four methods: two LLM-powered ones, and two classical ones (for a good measure)

  • Zero-Shot LLM Classification
  • Few-Shot LLM Classification
  • TF-IDF + Support Vector Classifier
  • TF-IDF + Decision Tree Classifier

What this post will not be about:

  • It's not a tutorial on sentiment analysis, NLP, or ML in general. If you need a refresher on these topics I can suggest geeks for geeks, it's quite alright.
  • It will not be scientifically rigorous in a slightest. I'm using a toy dataset, the implementation is quite hand-wavy and nowhere near to optimal. I've essentially had couple hours of spare time, to cook this up, so here we are.

Problem setting

Sentiment Analysis is a well known problem – given some text, we want to mark it as a positive, negative or neutral. I've semi-randomly picked this Twitter sentiment dataset from Kaggle. It has three classes (Negative, Neutral, Positive). It comes with a pre-split for training (74000 tweets) and testing (1000 tweets).

I will use accuracy, recall (one-vs-rest per class), and precision (one-vs-rest per class) to assess test performance.

I will also record time of training and inference, as well as the cost of inference.

For both LLM methods I will be using OpenAI's API for gpt-3.5-turbo-instruct model.

Experiments

Zero-shot LLM Classification

The first and obvious method to try is zero-shot classification: ask an LLM to generate response, given a prompt of the following form

Tweet: {tweet_text}
Sentiment:

Then we just match responses to Positive, Negative or Neutral (regarding anything other than these three as Unknown). For zero- and few-shot we are not using the training set. So let's see how it performs on our test set.

Accuracy: 0.509

Recall

Negative    0.729323
Neutral     0.271335
Positive    0.689531
dtype: float64

Precision

Negative    0.557471
Neutral     0.635897
Positive    0.502632
dtype: float64
Negative Neutral Positive Unknown
Negative 194 31 17 24
Neutral 123 124 172 38
Positive 31 40 191 15

Well, it's not perfect. It's alright, there's a decent recall in Negative and Positive cases. However, many cases are Unknown. This is one of the major drawbacks: out of the box LLMs can generate pretty much anything, and matching free-text response to our discrete classes.

On top of that, processing 1000 test cases took 4 minutes and 12 seconds. The number of processed tokens is ~44000, which results in about $0.066 (and that's just for one run, I've actually executed the whole notebook a few times while experimenting, I'll drop the figure of how much I've spend on OpenAI's API while writing this article at the end). Let's see if we can improve that with the few-shot classification.

Few-shot LLM Classification

What's usually suggested you do next then? Improve the prompt and give the model the model some examples to work with – the few-shot learning. So, taking one example of each class from the training set, we'll use the following template.

Tweet: <negative training sample>
Sentiment: Negative
Tweet: <neutral training sample>
Sentiment: Neutral
Tweet: <positive training sample>
Sentiment: Positive
Tweet: {tweet_text}
Sentiment:

Then we do the same matching procedure. So, let's see the performance on the test set

Accuracy: 0.574

Recall

Negative    0.710526
Neutral     0.553611
Positive    0.476534
dtype: float64

Precision

Negative    0.564179
Neutral     0.569820
Positive    0.597285
dtype: float64
Negative Neutral Positive Unknown
Negative 189 71 6 0
Neutral 121 253 83 0
Positive 25 120 132 0

Well, the performance is somewhat better. The accuracy improved. The recall for the positive case is worse. We get rid of the Unknown predictions, but it's still not too great.

The time and costs though are 4 minutes 37 seconds and 161k tokens * $0.0015/per 1k = $0.2145! So, we've still got a subpar sentiment classifier, that takes a significant time and costs over 20 US cents to classify 1000 tweets. Can we do better? Why yes, let's actually turn back to some classical ML methods instead.

Classical ML Methods for NLP

One of the standard approaches to sentiment analysis is to turn text into vectors somehow (we'll use one of the simplest approaches – TF-IDF vectorisation), and feed it to some classification model (we'll use SVM and Decision Trees). I thought that taking the methods that are usually discussed in the intro to ML and NLP undergrad courses, rather than something sophisticated, will make my point even more clear.

Let's see what we get with those. First of all, vectorisation. TF-IDF takes 1.48 seconds to process both training and test sets (total of 75000 tweets!)

Training and inference times for SVC and Decision Trees are recorded below

Method Training Inference
SVC 3.28s 1.44ms
Tree 23.3s 2.4ms

Blazing fast! And that's without using any GPU. It will even run on a few year old laptop without any issues. Let's now have a look whether we lose anything in terms of performance.

Method Accuracy Recall (Neg/Neu/Pos) Precision (Neg/Neu/Pos)
SVC 0.85 0.86/0.85/0.83 0.84/0.87/0.82
Tree 0.92 0.94/0.91/0.91 0.88/0.93/0.92

On the contrary, the performance is way better. Decision Tree method is a touch better on all scores, but it takes more time to train (albeit, it's still under a minute). And it costed me virtually nothing (I run everything inside a Colab notebook). So, what does it give us.

Results

Performance:

Method Accuracy Recall (Neg/Neu/Pos) Precision (Neg/Neu/Pos)
Zero-shot LLM 0.51 0.73/0.27/0.69 0.56/0.64/0.50
Few-shot LLM 0.57 0.71/0.55/0.47 0.56/0.57/0.60
SVC 0.85 0.86/0.85/0.83 0.84/0.87/0.82
Tree 0.92 0.94/0.91/0.91 0.88/0.93/0.92

Times and costs:

Method Time training Time inference Cost
One-shot LLM n/a 5min 51s 44k*0.0015=$0.066
Few-shot LLM n/a 5min 42s 161k*0.0015=$0.2145
TF-IDF + SVC 4.76s 1.44ms Practically zero
TF-IDF + Tree 24.8s 2.4ms Practically zero

Well, both classical methods surpasses the LLM methods in test performance, time taken and costs. Huh.

Discussion

One could say that I didn't tune the LLM methods well enough, and I could improve the outcome by prompt engineering, further tweaking or even fine-tuning. But that's the point. I've already spent $3.75 on OpenAI API today. Why spend even more, if I can achieve rather good results with simple sklearn provided models (and I could make it even better by spending time tweaking SVMs or Decision Trees, yay).

One could say that I could use some open LLM from Huggingface Hub, and host it on my own hardware, instead of paying for the API. Sure, but I could run the SVM on a Raspberry Pi Zero easily; whereas the smallest transformer model capable of such a task will require quite a bit more compute (GPUs are recommended, significant amount of RAM is needed). And this compute power still costs money, you know.

One could say that I use a task that's just not very suitable for LLMs. Exactly, that's the point. You can sure try doing that, and in some cases it might work, but it will likely not be very efficient at it. There are some other tasks that LLMs are very capable at, and other methods just don't do a good job, surely enough. So let's use LLM for these tasks then.

The main point that I want the reader to take away here is that, before grabbing onto an LLM to solve your new task on hands, I want you to consider, if there's a good simple, lightweight and well-researched method for the thing you want to do. Maybe there indeed is.

This post is directed at nobody specifically, but inspired by frequent misuse of technology by people on the Internet.

The code for this set of experiments is available in the Colab Notebook (provide your own dataset files and OpenAI API keys).

Subscribe to D. Lowl's personal site

Sign up now to get access to the library of members-only issues.
Jamie Larson
Subscribe