Performance Benchmarking: A Detailed Comparison Between Llama and Llama 2

Dr. Santanu Bhattacharya
DataDrivenInvestor
Published in
6 min readJan 17, 2024

--

Meta’s launching of Llama 2, an advanced open-source LLM (Large Language Model), has overtaken the AI enthusiasts’ community. The excitement surrounding this creation can be attributed to several reasons. Its efforts fueling the expansion of the boundaries in the AI industry is one of them. What started as a social media platform has now evolved into an AI powerhouse, bringing advancements in natural language processing and other cutting-edge technologies. With each launch showcasing its prowess in Artificial Intelligence, many have been generously open-sourced, promoting a collaborative and inclusive approach to innovation. There is also speculation that they might surpass the benchmark set by another AI giant, OpenAI. The competition can be interesting with the outcomes holding the potential to shape the future of this industry. To gain a good understanding of Llama 2’s capabilities, our team conducted elaborate testing on custom use cases, using the 34B model. The testing ground was set on AWS Sagemaker, equipped with a 24GB A10 GPU and the evaluation process assessed the following article.

For context, Llama 2 is a family of pre-trained and fine-tuned LLMs, ranging from 7 Billion (7B) to 70B parameters. According to Meta AI, Llama 2 Chat LLMs are optimized for dialogue use cases and outperform open-source chat models on most benchmarks they tested. Based on Meta’s assessment of human evaluations for helpfulness and safety, the company says Llama 2 may be “a suitable substitute for closed source models.”

Llama 2, like the original Llama model, is based on the Google transformer architecture, with improvements including RMSNorm pre-normalization, inspired by GPT-3; a SwiGLU activation function, inspired by Google’s PaLM; multi-query attention instead of multi-head attention; and rotary positional embeddings (RoPE), inspired by GPT Neo. Llama 2’s also increased context length (4096 vs. 2048 tokens) and grouped-query attention (GQA) instead of multi-query attention (MQA) in the two larger models (which larger models, 53B and 70B?)

Llama 2’s training corpus includes a mix of data from publicly available sources, which Meta says, does not include data from Meta’s products or services. There were two trillion tokens of training data.

Meta used its Research Super Cluster and some internal production clusters for pre-training, with Nvidia A100 GPUs. Pre-training time ranged from 184K GPU hours for the 7B-parameter model to 1.7M GPU hours for the 70B-parameter model.

Fine-tuning Llama 2 Chat took months and involved both supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). Meta used Ghost Attention (GAtt) to keep Llama 2 Chat from forgetting its system message (overall instruction) from turn to turn in a dialogue.

Putting Llama 2 to the Test

To gain a good understanding of Llama 2’s capabilities, our team conducted elaborate testing on custom use cases, using the 34B model. The testing ground was set on AWS Sagemaker, equipped with a 24GB A10 GPU and the evaluation process assessed the following criteria:

Quality of Output: To assess the quality of output, custom internal data with human-curated responses was utilized. The model underwent tasks, including information extraction, content generation, and information verification. The accuracy of output, adherence to instructions, and potential deviations were then carefully measured.

Performance: Performance levels were gauged by measuring the time taken to generate output across five sets with varying token lengths. To benchmark the cost, OpenAI served as a pivotal reference. Both GPT3.5 and GPT4 APIs were used for comparison, and cost calculations were based on tokens per second and straightforward mathematical computations.

Cost: A comparative cost factor analysis was also performed to check the economic feasibility of Llama 2 over its predecessor.

The Test Set

The evaluation set comprised five distinct sets, each containing different token sizes: 100, 250, 500, 750, and 1,000. Each set presented ten questions, providing publicly available banking product information text as context for each inquiry. In the 100-token set, product titles took center stage, prompting the model to offer its inference or judgment. The validation set played a vital role in comparing the model’s responses, with accuracy being the ultimate litmus test.

For the sets containing 500, 750, and 1,000 tokens, Llama 2’s capabilities were tested, extracting values for a list of attributes from the provided context. The answers were contrasted against the validation set, demanding correct answers that included all the required attributes.

Results

The following tables provide the results for 100, 250, 500, 750, and 1,000 tokens.

Figure 1: Raw Test Results for performance of 10 questions using 100 tokens
Fig. 2. Raw Test Results for performance of 10 questions using 250 tokens
Fig. 3. Raw Test Results for performance of 10 questions using 500 tokens
Fig. 4. Raw Test Results for performance of 10 questions using 750 tokens
Fig. 5. Raw Test Results for performance of 10 questions using 1,000 tokens. Note: the Llama model crashed while running this test set

Analytical Insights

This phase was a significant phase of the present study as it played a key role in revealing indicators behind Llama 2’s performance, quality, and cost.

Quality: Is it better?

During the evaluation process, Llama consistently delivered subpar outputs for the majority of questions, irrespective of the token sizes used.

Fig 6. Quality Score Comparison Between Llama and Llama 2

Also, as shown in Figure 6, the Llama 2 model drastically improved the quality and accuracy of responses, surpassing its predecessor. It achieved a perfect score of 100% on AWS when tested with 100 tokens. The model also maintained its consistency on AWS when tested with 250, 500, 750, and 1,000 token sets.

Performance: Is it faster?

Performance evaluation, which delved into response generation time, provided insight into Llama 2. On the AWS Sagemaker, equipped with a 24GB A10 GPU machine, the generation time was notably high. Additionally, the comparison between Llama and Llama2 on the 250, 500, 750, and 1,000 tokens revealed that Llama took 3–6 times as long as Llama 2 to generate responses.

Fig 7. Generation Time Comparison Between Llama and Llama 2, in Seconds

Cost: Is it cheaper?

The cost evaluation provided a glimpse into the economics of Llama 2. Considering GPU charges per second and total generation time per model’s response, Llama 2 proved to be significantly more economical to run on AWS compared to Llama.

Fig 8. Cost Comparison Between Llama and Llama 2, in Dollars

Summary

We ran an extensive comparison between Llama and Llama 2 using banking product information available publicly. The tests were run using the 34B model. The testing ground was set on AWS Sagemaker, equipped with a 24GB A10 GPU. The token sizes were 100, 250, 500, 750 and 1,000.

The tests were not a perfect 1:1 comparison. For some of the questions, Llama “broke” at the higher number of tokens, e.g.750 and 1,000.

The overall results were as follows:

  • The generation time ratio of Llama: Llama 2 was ~3–6, i.e., Llama took approximately three to six times as long as Llama 2 to provide an answer
  • The cost ratio of Llama: Llama 2 was ~2-4:1, i.e., Llama cost approximately two to four times as much as Llama 2 to provide an answer
  • The accuracy ratio of Llama: Llama 2 was ~1:8, i.e., Llama 2 was eighth times as accurate as Llama

Albeit, these answers are not universal and were significantly influenced by larger token size where for some of the questions Llama was not able to provide answers after a long wait.

However, we found this framework to be a useful one for comparing different models. It is a practical approach for a quick test when an enterprise, a bank, a Telco, or a healthcare provider, for example, is choosing models for specific applications.

Previous Story: State of AI Report 2023

Next Story: AWS Generative AI Forum, 2023 Mumbai

Visit us at DataDrivenInvestor.com

Subscribe to DDIntel here.

Have a unique story to share? Submit to DDIntel here.

Join our creator ecosystem here.

DDIntel captures the more notable pieces from our main site and our popular DDI Medium publication. Check us out for more insightful work from our community.

DDI Official Telegram Channel: https://t.me/+tafUp6ecEys4YjQ1

Follow us on LinkedIn, Twitter, YouTube, and Facebook.

--

--

Chief Technologist at NatWest, Prof/Scholar at IISc & MIT, worked for NASA, Facebook & Airtel, built start-ups, and future settler for Mars & Tatooine