Buy vs. Build for an Enterprise: “ChatGPT” vs. Open-Source vs. QLoRA

Published in

DataDrivenInvestor

12 min readJul 18, 2023

As the world gets taken over by ChatGPT promising to revolutionize everything from education, business, healthcare, commerce, and more, a natural question arises: how does an enterprise such as a bank, healthcare provider, telecommunication company or retailer decide on implementing such systems in their business? Specifically, whether to use third-party Large Language Models (LLM) like ChatGPT or build the infrastructure internally? How costly are ChatGPTs vs. open-source LLMs? What are the tradeoffs in terms of performance, ongoing costs, risks, etc.?

**Photo 1:** In a city, people build with LEGOs. Each brick: a choice — open-source LLMs or ChatGPT. As they connect, decisions form a towering structure. **Source**: Stable Diffusion

TLDR: Before implementing language models like ChatGPT, regulated industries must carefully consider several risks related to data privacy, AI model bias, observability, and customer trust, particularly when handling private customer data. These concerns are critical in maintaining compliance and protecting sensitive information.

In cases where it makes sense to use ChatGPT, such as summarizing a company’s product offerings for chatbot query, there is a threshold of usage when ChatGPTs are economical. ChatGPT is less expensive than utilizing open-source LLMs deployed to AWS when the number of requests is ~1000s per day. However, as the request volume escalates to millions per day, the economics shift, and deploying open-sourced models on AWS becomes the more affordable option, especially considering the current pricing structures for both ChatGPT and AWS.

LLMs, with the introduction of Transformers in 2017, followed by breakthrough models like BERT, GPT, and BART, are transforming the way we communicate and interact with technology, and their impact is being felt across industries and borders.

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an…

arxiv.org

A few years back researchers from OpenAI and Google documented in multiple papers demonstrating that LLMs with over 10’s Billions parameters are increasingly sophisticated, able to understand complex language structures and generate human-like responses

Indeed, language models like LLMs are rapidly gaining prominence and becoming ubiquitous in various domains. They are poised to revolutionize numerous aspects of our lives by powering virtual assistants, chatbots, content creation tools, and translation services, among other applications. The capacity of these models to efficiently process and analyze massive volumes of data, coupled with their ability to learn from it, is reshaping the way we work, learn, and communicate.

Now that many enterprises are considering integrating their processes, functions, and workflows with LLMs, enterprises are looking for how to approach them.

Consideration 1: Strategic Choice between ChatGPT and Open-source LLMs

There are several considerations associated with using ChatGPT or third-party LLMs in regulated industries such as banking, healthcare, or Telecommunication, especially when using customer data to train them. ChatGPT was trained largely on public data from the internet, making it susceptible to the following risks:

Privacy Risks: Many LLMs were trained on public data that is not privacy preserved. Note the recent lawsuit by Getty Images against Stable Diffusion, alleging that 12 million images from Getty’s repository “without permission.. or compensation”. Expect such lawsuits to proliferate, potentially exposing LLMs provider’s enterprise customers, too.

Bias Risks: Having been trained on public data, many third-party LLMs can not guarantee that the models are not biased, exposing their enterprise customers to the risk of regulatory fines and penalties.

Customer Data Concerns: Customers may worry about a third-party LLM being exposed to their sensitive financial information, for example, even if the ChatGPT is used within a private instance of their bank. Similarly, most people will be seriously concerned about LLMs having access to their private health data.

Lack of Trust: Due to a combination of factors such as Deepfakes. For instance, in 2021, a UAE Company lost $35M due to the Deepfaking of the Company Director’s voice and such cases will rise exponentially. How do banks protect customers’ trust if such events happen? What if on top of such news, bank customers wonder about what their personal data is being used for, especially training third-party LLMs?

Observability Risks: Because it is hard to “observe” such large models, the correlation between the inputs and output is not known and attribution of cause and effect is all but impossible

Irreversibility of Harm Due to the “Blackbox model”: As the models are Blackbox and not observable, harms are not known (except in obvious cases) and hence, irreversible.

Liability: Who is liable for the harm caused by the LLM models: the model creators (say LlaMa by Meta) or say, a bank who tweaked it? The same analogy is used in the event of a car crash caused by a spoiler flying off. Is it the OEM who created the spoiler that flew off and hit another car, the dealership who may have installed it inappropriately, or the user who may have modified it?

However, there are areas where the use of ChatGPT is relatively safe. For example, using ChatGPT for summarizing a company’s product offerings to better answer customer support questions does not involve customer data, nor will it likely violate any privacy concerns.

If an enterprise decides to explore the option of using ChatGPT, the comes the question of cost.

Consideration 2: Cost Comparison between ChatGPT and Open-source LLMs

Option 1: ChatGPT API Costs

ChatGPT API today costs $0.002/1K tokens and a token is about 3/4th a word. The number of tokens in a single request is the sum of prompt + generated output, in words, converted tokens.

Let’s assume an enterprise processes incoming customer queries through an interactive chatbot using ChatGPT. A question and answer is roughly a page: so it is 500 words or 666 tokens. Answering 5,000 customer queries per day then costs (($0.002/1000)x666*5,000)= ~$6.5 a day, or $200 a month.

But what happens if each customer takes 4–5 prompts to get the right answer they are looking for? After all, they are not necessarily well-trained prompt engineers accustomed to writing finely tuned queries to get an answer. And assume, we have 200 thousand such queries coming in per day, about the normal for a contact center for a major brand.

Under this scenario, the cost skyrockets to ~0.5 Million$ per year, making ChatGPT a significant expense for an enterprise business! No wonder Venture Capitalists (VCs), accustomed to identifying opportunities for making money, are pouring billions of dollars into “ChatGPT for X” ideas.

ChatGPT Fever Has Investors Pouring Billions Into AI Startups, No Business Plan Required

Amid broader venture-capital doldrums, it is boom times for startups touting generative artificial intelligence tech

www.wsj.com

Option 2: Open-source Large Language Models

2a: Open-source LLM Costs: Factors and Model Dependency

For enterprises looking for Open Source LLM, options are aplenty. My former employer, Meta (Facebook) pioneered LLaMA, with a variety of model sizes between 7 to 65 billion parameters. According to the LLaMa developers, the 13 billion parameter model outperformed a considerably larger GPT-3 with 175 billion parameters on the majority of NLP benchmarks. Later, the teams at Stanford University fine-tuned the 7B version of LLaMA on 52K instruction-following demos and found that their Alpaca model outperformed GPT-3.

While open-source models are free to use, the infrastructure to host and deploy them is not. Earlier resources such as BERT were less resource intensive and could be fine-tuned using low-end GPUs. However, more recent LLMs such as LLaMa are more resource intensive.

Most common LLMs models such as GPT-3 or BERT are based on transformer-based architectures. While the number of operations required for inference and learning for such transformers are model-specific, the rule of thumb for transformers (i.e., the weights of the neural networks) is that a forward pass (i.e., inference) for a model with p parameters for an input and an output sequence of length n tokens each, takes approximately 2*n*p floating point operations (FLOPs) per token, as the additional backward pass requires four more operations. Therefore, a user can approximate the total training cost by the number of tokens in the training data, multiplied by parameters of approximate factors (2 or 6).

Memory requirements for transformers are model-dependent, too. For inferencing, p-model parameters need to fit into memory. For back-propagation during learning, the additional intermediate values per parameter between the forward and backward pass need to be stored in the memory. Assuming 32-bit floating point numbers and for training a 175-billion-parameter model, the model would need over 32*175X10⁹ bytes, or, over 6 TB of data in memory. This exceeds any GPU in existence today and hence requires the model to be split across multiple memory cards.

Using the formulas shown above, here are the training and inferencing requirements for BERT and GPT-3. For a refresher, the training requires 6*p FLOPS per token, while the inferencing requires 2*n*p FLOPS, with p being parameters and n being tokens for training data

**Table 1**: Comparison of Training and Inferencing between BERT and GPT-3

Navigating the High Cost of AI Compute | Andreessen Horowitz

The generative AI boom is compute-bound and, as a result, a predominant factor driving the industry is simply the cost…

a16z.com

How do computational complexities translate into cost? While one can go down the path of further translating the infra requirements to specific GPUs, the optimization of the algorithm (for example, using 16 floating point numbers vs. 32 floating points) and the number of training passes required to achieve model stabilities to arrive at a range of numbers, our literature survey indicates that it can vary as a factor of 10: between $0.5–5M for GPT-3. This is a large sum of money requiring significant investment that only large companies or well-funded startups can afford.

2b: Open-Source LLM: Architecture for Deploying Open-source Models

Short of building one’s data center to host, train and deploy LLMs, a more practical solution for enterprises would use cloud providers like AWS, Google, Azure, or smaller providers such as Lambda Labs to host and deploy such models. Many enterprises such as Banks, Healthcare providers, and Telcos already have a deep relationship with Big-3 cloud providers, making this option an attractive one for them. Given our familiarity with AWS, we will use the AWS infrastructure as an example.

Let’s dive into AWS costs for hosting open-source models and serving as APIs — usually involving four steps.

A client device, for example, a browser invoking a customer request, which is then passed through Amazon’s API Gateway
The API Gateway in turn triggers the Lambda which parses the function and sends it to AWS Sagemaker Endpoint
The model is then invoked at the endpoint using AWS Sagemaker

The Sagemaker costs are sensitive to the type of computing instance for hosting the model as LLMs use rather large computing instances.

For example, this article written by Heiko Hotz details how to deploy Flan UL2 — a 20 Billion parameter model, on AWS:

Deploy Flan-UL2 on a Single GPU With Amazon SageMaker

The Hugging Face + AWS partnership makes it easier than ever to experiment with open-source state-of-the-art language…

betterprogramming.pub

The article uses the ml.g5.4xlarge instance for the deployment of the Flan-UL2. While the Sagemaker pricing above did not list the cost for this specific instance, it appears that it would cost ~$5–6 per hour, or $150 per day! In addition, the Lambda pricing and API gateways are about ~$10 and $1 per million requests.

So ultimately the cost of hosting an open-source LLM like Flan-UL2 on AWS is $150 for 1000 requests a day, and $160 for 1 M requests a day, or about $500,000.

Option 3: Open-Source LLM Costs for Smaller Language Models

For even simpler tasks, for example, creating a summarisation engine for a corporate HR policy on which one can make an employee chat or, or, for spam detection smaller language models like BERT that are 100’s of millions of parameters are sufficient.

For training BERT, one can use cheaper instances like ml.m5.xlarge which is $0.23/hour and ~5$ a day. These models are also quite powerful for “narrow” applications, compared to ChatGPT and GPT4 which can understand complex nuances of human language.

Consideration 3: Using Quantized Models Such as QLoRA

In a recent paper, a new method, QLoRA, has been announced which provides almost a game-changing ability to train and fine-tune LLMs on consumers’ GPUs. In comparison to fine-tuning 16-bit models, QLora uses less memory and does not sacrifice performance.

Using this method one can fine-tune a 33B model on a single 24GB GPU. For 64B model fine tuning it uses a single 46GB GBU. QLoRA achieves this by using 4-bit quantization to compress a pre-trained language model. The language model parameters are then frozen and a relatively small number of trainable parameters, in the form of Low-Rank Adapters are then, are then added to the model.

During finetuning, QLoRA backpropagates gradients through the frozen 4-bit quantized pre-trained language model into the Low-Rank Adapters (LoRA). According to the LoRA study, which goes into great depth, the only parameters that are modified throughout training are the LoRA layers.

QLoRA uses a 4-bit NormalFloat storage data type for the base model weights and a 16-bit BrainFloat data type to perform computations. During forward and backward passes, QLoRA de-quantizes weights from the storage data type to the compute data type. However, it only computes weight gradients for the LoRA parameters, which is the 16-bit BrainFloat. The weights are decompressed only when they are needed, which is far lower than 100% of the cycle time, therefore significantly reducing memory usage during training and inferencing.

QLoRA tuning, while significantly reducing memory usage and requiring only 24 hrs of fine-tuning on a single GPU, reaches a performance level approaching 99.3% of the performance level of ChapGPT. Moreover, Guanaco, which uses QLoRA finetuning on the LLaMA model, reaches close to CHATGPT on the Vicuña benchmark

The cost of training and finetuning is now approaching sub-$10,000 or even sub-$1,000 on QLora.

**Figure 1:** Schematic for Comparison of ChatGPT, LLM, Smaller Language Models, and QLoRA

Summary

The choice between using models like ChatGPT and GPT-4, which are developed by companies like OpenAI, versus open-source LLMs depends on several factors and considerations.

Models like ChatGPT and subsequent iterations from OpenAI often offer more relevant responses compared to open-source LLMs. They benefit from significant research and development efforts, incorporating advanced techniques and extensive training on vast datasets. These models are designed to have broad applicability and serve a wide range of use cases.

**Table 2**: Comparison of Models’ Cost, Advantages, and Disadvantages

However, open-source LLMs are rapidly catching up in terms of performance and relevance. They provide the advantage of being customizable and fine-tunable on specific data sources. Companies can leverage open-source models to train and fine-tune them on their domain-specific data, leading to potentially better performance in specialized contexts. This fine-tuning process allows organizations to tailor the models to their specific needs and optimize them for their unique requirements.

Furthermore, there are valid reasons for choosing open-source models over closed APIs provided by companies. Using open-source models provides more control and transparency as organizations have access to the underlying code and can modify it as needed. This level of customization can be valuable for companies with specific privacy, security, or compliance requirements.

If this article was written six months back, standard offerings from OpenAI and such would have been a clear choice for the enterprises. However, in the last six weeks, significant improvements in the quantized models such as have opened up opportunities for enterprises, specifically those that have a reasonably strong technical workforce who are willing to quickly experiment and learn

And models are only one part of the story. The success of one’s AI journey depends on various factors such as the quality and diversity of the training data, the fine-tuning process, and the specific use case or the domain being addressed. An example of it is BloombergGPT, which is leveraging highly domain-specific data to deliver high-quality models.

Ultimately, the choice between using closed models like ChatGPT or open-source LLMs depends on the specific needs, resources, and priorities of an organization. Evaluating factors such as performance, customization options, data privacy, and compliance requirements will help determine the most suitable approach for leveraging LLMs in a given context.

Acknowledgments

I was inspired to write this article after reading a really nice post on LLM by Skandagupta Vivek. As I started delving more, I came across more articles, especially one by Andreessen Horowitz, and then the papers on QLoRa. Some of these papers have been embedded in the article although I couldn’t do that for every article. This article has also been helped by conversations with many of my colleagues. Last, but not least is my editor, Chandrajita Chakraborty, who played a crucial role in shaping this article into its final form.

Previous Story: What Kind of AI Society Are We Heading Towards: The Star Trek, Mad Max, or Elysium type?

Next Story: Where did the Name Bluetooth Come From

Subscribe to DDIntel Here.

DDIntel captures the more notable pieces from our main site and our popular DDI Medium publication. Check us out for more insightful work from our community.

Join our network here: https://datadriveninvestor.com/collaborate