On the planet of enormous language fashions (LLMs), the price of computation could be a important barrier, particularly for intensive tasks. I lately launched into a challenge that required operating 4,000,000 prompts with a mean enter size of 1000 tokens and a mean output size of 200 tokens. That’s practically 5 billion tokens! The normal method of paying per token, as is widespread with fashions like GPT-3.5 and GPT-4, would have resulted in a hefty invoice. Nevertheless, I found that by leveraging open supply LLMs, I may shift the pricing mannequin to pay per hour of compute time, resulting in substantial financial savings. This text will element the approaches I took and examine and distinction every of them. Please word that whereas I share my expertise with pricing, these are topic to vary and should range relying in your area and particular circumstances. The important thing takeaway right here is the potential value financial savings when leveraging open supply LLMs and renting a GPU per hour, reasonably than the precise costs quoted. In the event you plan on using my really useful options in your challenge, I’ve left a few affiliate hyperlinks on the finish of this text.
I carried out an preliminary check utilizing GPT-3.5 and GPT-4 on a small subset of my immediate enter information. Each fashions demonstrated commendable efficiency, however GPT-4 persistently outperformed GPT-3.5 in a majority of the circumstances. To offer you a way of the fee, operating all 4 million prompts utilizing the Open AI API would look one thing like this:
Whereas GPT-4 did supply some efficiency advantages, the fee was disproportionately excessive in comparison with the incremental efficiency it added to my outputs. Conversely, GPT-3.5 Turbo, though extra reasonably priced, fell quick when it comes to efficiency, making noticeable errors on 2–3% of my immediate inputs. Given these elements, I wasn’t ready to take a position $7,600 on a challenge that was…