Generative language models are getting consistently faster. The biggest state-of-the-art models, often called “frontier models,” such as Anthropic’s Claude, Meta’s Llama, and OpenAI’s GPT-4, can generate entire paragraphs within a few seconds. The time it takes to get a complete response from one of these models, its “end-to-end inference latency,” often dictates which applications the model can be used for. For example, fast models can process information in near real-time, which makes them suitable for powering chatbots or virtual call center agents. Slower models may only be useful for data processing tasks that can happen in the background so that users don’t perceive long wait times.
When developing a new application, software developers must choose which model or sets of models to implement. This decision has significant consequences for the cost of running the application and the perceived speed of the application, which is directly tied to user experience (UX). To help them make their decision, developers can reference published benchmarks that compare the latency and cost of different language models. However, published metrics can be problematic, as companies publishing the benchmarks often have incentives to make their models appear faster, cheaper, and more accurate than their competitors.
At 11:59, we strive to use real data to drive these types of decisions. In a recent project where model latency was critical for the application, we weighed the benefits of leveraging AWS’s new Nova model family against the more well-established and popular Anthropic Claude model family. AWS published metrics suggesting that Nova was faster and cheaper than Claude, but we needed to confirm this ourselves to be confident in our decision to use these models in certain applications.
To compare model speed, we selected three models from the new AWS Nova model family (micro, lite, and pro) and three models from the Anthropic Claude family (haiku, sonnet, and sonnet v2). We accessed all models via AWS Bedrock in the same region. The simple prompt “Create a short description of AWS” was sent to each model 200 times, and we recorded the end-to-end inference latency and the number of generated characters. We then plotted the number of characters generated vs. the inference latency to compare differences in each model. We analyzed the data by calculating the 98% confidence interval (the area in which 98% of all the points were located) and drew ellipses to represent this region for each model.
From the chart, we can see that the models exhibit clear patterns. The Nova family is roughly twice as fast as the Claude family, as they have much lower end-to-end inference latencies. These models also cost approximately 10% of Claude's price, so they generally outperform Claude in speed and cost. The trends of each Nova model are more consistent and linear than those of the Claude models, which means Nova's latency is more closely correlated to the volume of generated characters than Claude's. This tells us that given the number of tokens Nova will generate, we can accurately predict how long it will take to generate them. This is not the case with Claude, as the models show much less correlation between the latency and the number of characters generated.
We can also see that Nova-micro and Nova-lite produce an average of ~15% more characters than Nova-pro when given the same prompt. This is interesting because Nova-pro is marketed as the Nova model with the best reasoning capabilities. The results here tell us that because of Nova-pro's high reasoning capabilities, it may be able to address our prompts with fewer tokens than the smaller Nova models, which means that it’s generally more concise.
Moving forward, we can be confident that the AWS Nova models offer the low cost and low latency that some of our projects demand. When we perform quantitative studies of new technologies, we gain trustworthy, valuable insights that are often hard to get when relying on published benchmark metrics. Ultimately, results like these allow our team to make data-driven decisions that reduce costs, improve performance, and provide overall better UX for our clients and their applications.