Aug 30, 2023

LLM infra for indies on cheap

Open Source AI Game Jam Postmortem ---- So, I have participated in Open Source AI Game Jam organised by Huggingface last weekend.

Open Source AI Game Jam Postmortem

itch.io page: Conspiracy Catalyst
Github repo: d-lowl/conspiracy-catalyst

So, I have participated in Open Source AI Game Jam organised by Huggingface last weekend. The idea of the jam was to use some opensource AI (I prefer to call this "ML" in this context, though) tools to create a game. It could either be a tool used in the pipeline (e.g. code or asset generation), or something within the game itself (people used it for TTS and STT, procedural graphics and dialog, interesting stuff). There was also a theme, but it was quite broad, so it was not a limiting factor.

What was the plan?

I have been having an idea for quite a while at that point, make a simulator game where you write post to a Twitter-like website, and some ML-based agents reply and 'rate' the stuff you write based on some internal values. The inspiration comes from a few places, one of which being Democracy where you have a set of population groups, which react differently to your actions.

Given the jam's theme -- Expanding -- I thought that'd be cool to try building an SMM sim for a secret organisation/cult (which intentions are unknown from the start). So your task would have been to expand your influence, by attracting followers on the social media platform (unsuspiciously named Critter). The core pipeline I imagined looked like this:

Player inputs a post text
For each target group: pass through LLM a prompt containing this group's values and the player's text -- generate a group's reply
Pass this replies through some sentiment analyser, to determine how positive the reaction to the player's post is
Turn the sentiment score into a number of followers the player gains or loses

So, there were a few major tasks that needed to be solved during the jam

What models to use for replies and sentiment analysis
How to construct prompts and values for groups so they are interesting, and self-consistent
Do something with sentiment score to turn it into followers in a fair and consistent way (i.e follower gain should not contradict the replies)
Where and how to host all of these (for cheap too, I didn't want to spend much money on the jam)
Do a front-end for the game, and make it, you know, pretty and at least somewhat fun

Well, I'm writting this, because not everything went according to the plan above. The rest of the postmortem is reconstructed from a loosely kept log.

So, what went wrong?

Sat, 09:58

For the replies I sure need an LLM of some sorts. I didn't want to use GPT by OpenAI for a few reasons:

it's not open-source; even though it's allowed, I would have needed another open source ML thingy in there, which I was not sure I would have had at the time
I do not want to spend that much money on an API
I do not want to ask players to bring their own API keys (even though that's what many such demos do)

I went through what other LLMs are supported by Langchain, since this is a tool I have some experience with, and I knew it would simplify development a bit. The choice after a bit of reading was to use OpenLLM by BentoML. It allows self-hosting by just running the executable, it has a bunch of models of different sizes. I've downloaded the smallest versions of the supported bunch. The decision was to continue with dolly-v2-3b, since it was the only one, that I could make into giving me back decent results for the prompts I've written, even though I had to twist and turn the parameters quite a bit.

So far so good:

I have a model and a prompt (with one set of group values, the most boring of the planned to be fair)
I've had brunch

Now I need to think about how I deploy these things. About 6-7 hours has passed since the start.

Sat, 5-6 pm

It was time to host the model, and the python server code somewhere. I wanted to avoid deploying it in the clouds (and surely not K8s): it felt too expansive (I haven't actually looked at the prices at the time), and too much infrastructure overhead, for just one python script that just calls the model (just, lol).

The first idea that I had was to just push it to the VPS that I already have, and execute OpenLLM the same way I did it on my laptop. Sounds easy enough. Here's a list of things that happened (I've got no exact recording of things that played out, but):

It's an ubuntu 22.04 machine, the system Python version conflicted with what I have used locally; some dependencies needed to be rejigged
I had to up the RAM on the VPS from the 8 gig to 24 gigs, for some reason OpenLLM errored out, saying it's out of memory. Locally, it took around 6 gigs, so should have been anough.
The machine had only 20 gigs of storage, and the model is 5.6 gigs, about 12 gigs was already used by the system. Surely, that's enough to download a model from the hub? Nope, because it's written to the filesystem twice. First, it's downloaded into bentoml cache. Second, it's copied again to the huggingface transformer cache.

And so on. After a few hours of trying to make OpenLLM work on my VPS, I gave up. Time to have a dinner.

Sidenote: BentoML has what they call BentoCloud, but it's in a closed beta right now. I've applied for access, but I haven't got it yet at the time of writting.

Sat, ~9 pm

It's to change the strategy. I have considered briefly, what I can still do with OpenLLM. One alternative option I had is to build a BentoML container, and deploy it on Google Cloud or Azure (I had some free credits on the later). But first, I wanted to dumb down my solution and actually swap to using Huggingface transformers pipeline directly in my service. Thankfully Langchain supports it too, and swapping OpenLLM for HF was a matter of couple lines of code.

Swap. Test that the pipeline still works fine. A few more experiments with the prompt. Happy that it works alright. Now back to the VPS.

And long story short, what I haven't considered is what CPU my VPS has. That instance had 4 cores, but according to top the pipeline only used 1 for some reason (everything was utilised on my laptop huh). Inference time on my laptop was about 30 seconds, on the VPS about 10 minutes. 20 times difference.

So this was not a solution I was looking for. With this inference time, the game is basically unplayable.

The time was nearing midnight, out of ideas I went to sleep.

Sun, 10 am

Time for plan C. During my experimentation, I've tried using Huggingface Inference API, but it was pretty unreliable (long queues, long inference time). However, what I haven't considered initially was Huggingface Inference Endpoints

The claim is a one-click deployment of any model on their Hub. However, there's no way (I couldn't find at least) to pass additional parameters to the pipeline. Annoyingly, dolly-v2 needs trust_remote_code flag set, since they have an additional instruction pipeline script. Literally, the only way to resolve that, that I could find was to fork the model repo to add a custom handler, just to pass one additional flag.

Sun, 12:50

I have got the model deployed and replying back with a prediction. Inference Endpoints is actually quite usable. The inference time is also good, about 1 to 2 seconds. However, the results are way worse than for the same pipeline locally. The next two hours are spent trying to figure out why the results are so different (there was a lot of repetition of the prompt in the response, and it didn't sound like an answer at all). I've tried larger versions of dolly-v2. I've tried a bunch of different model parameters. Until I realised, that the response didn't actually have the certain tokens one would expect from it.

Remember I mentioned that dolly-v2 comes with an additional script for the pipeline? Well, apparently the runner of the Inference Endpoints, ignored it entirely, and I was running it without the intended instruction prompt this whole time. After I've patched this on the client side, the model responses started to make sense again. Wow. At this point in time, I finally have an LLM that I can query and use in the game.

Sun, 17:58

I've got about 3 hours before the submission deadline, and all I have is a very crude prompt and a model. In the next two and a half hours I did

The socket IO server and client so that the game would run on the server, and respond back to player's browser
Really bare frontend of a game. Monospace default font. Black and white bordered boxes. Very minimal styling. Pure HTML and JS though, which I do like
Deployed the Python server to Render.com and resolved CORS issues (of course there would have been CORS issues, when your client is on itch.io, why didn't I think of it)
Packed the game up, and made it work on the itch.io page
Wrote description for the game page (2/3 of which is written by ChatGPT though, so yeah)

What I did want to do, but didn't:

Actually create multiple different groups with different values
Slicker UI with at least some basic animations
Tune how sentiment analysis works, and how the followers are calculated; unfortunately this bit is very unbalanced in the end
Do some graphics; profile pics, organisation logo, etc.

Sun, 21:05

The game's submitted. I wrote the promotext for the Discord. Dropped links to the itch.io page here and there. I was quite annoyed with technology at the time too. Well, this blogpost is a way of reflecting on that

Instead of conclusions

I have had fun, and I have got a game in the end. It's not a very fun game, and I could have actually spent time making it fun instead of battling with LLMs and infrastructure, alas. For future self, I knew what kind of project I was going to do, I should have prepared the infra beforehands (or at least try out how it could be done).

Some people know that I was experimenting with generative fiction for gamedev for some time now, but I still have no demo, because 'there are this little things to fix, and then it's ready'. The fact that I have a result from the jam is good, I have a baseline now, the tight deadline helps apparently. I'm actually planning to participate in another jam in two weeks stay tuned.

A bit of a side note, when I was reflecting on what actually made this jam frustrating, I remembered a few the hackathons that I participated a while back. At those events sponsors and organisers often gave free credits for their services and APIs so teams could use that in their hacks. Maybe if Huggingface could give some Inference Endpoints vouchers for the duration of the event (there was a sane number of participants), people wouldn't need to use closed APIs, and wouldn't be hesitant to host their models with something like Inference Endpoints? (wink-wink)

So, it was fun, will do it again. LLMs can do interesting stuff if you actually spend time on this. Github repo for the project is coming.