DeepSeek V3 671B: Understanding Tokens Per Second on an NVIDIA RTX 4090

A Comprehensive Analysis of TPS Performance for Advanced AI Models

Key Takeaways

DeepSeek V3 671B's large parameter size poses significant memory and computational challenges on consumer-grade GPUs like the RTX 4090.
Quantization can improve TPS but only to a limited extent, often resulting in TPS ranging from 1 to 15 depending on various factors.
Running DeepSeek V3 671B efficiently typically requires specialized hardware beyond a single RTX 4090.

Introduction

The evolution of artificial intelligence has seen the development of increasingly sophisticated models, with DeepSeek V3 671B standing out due to its massive parameter count and advanced architecture. Users and developers are keenly interested in understanding the performance metrics of such models, particularly tokens per second (TPS), when deployed on powerful consumer-grade GPUs like the NVIDIA RTX 4090. This analysis delves into the feasibility and expected performance of running DeepSeek V3 671B in a quantized state on an RTX 4090, synthesizing insights from multiple authoritative sources.

Understanding DeepSeek V3 671B

Model Architecture and Parameters

DeepSeek V3 671B is a state-of-the-art Mixture-of-Experts (MoE) model, boasting an impressive 671 billion parameters. Unlike dense models where all parameters are active simultaneously, MoE models activate only a subset of parameters per token. In the case of DeepSeek V3, approximately 37 billion parameters are activated for each token. This selective activation enhances efficiency but still presents substantial memory and computational demands.

Quantization Techniques

Quantization is a pivotal technique in optimizing large models for deployment on hardware with limited resources. By reducing the precision of the model's parameters, quantization decreases memory usage and can enhance inference speed. DeepSeek V3 supports various quantization levels, such as 4-bit (INT4) and 8-bit (INT8) quantization. While 4-bit quantization offers significant memory savings, it may introduce slight degradations in model performance.

NVIDIA RTX 4090: Hardware Specifications

Key Features

The NVIDIA RTX 4090 is one of the most powerful consumer-grade GPUs available, equipped with 24 GB of VRAM and a memory bandwidth of 1 TB/s. These specifications make it a suitable candidate for deploying large-scale AI models, albeit with certain limitations. The RTX 4090's architecture supports advanced quantization techniques, including FP8, which can be leveraged to optimize model performance further.

Inference Capabilities

In the realm of AI inference, the RTX 4090 demonstrates impressive capabilities for models with up to several billion parameters, especially when employing quantization and optimization frameworks like TensorRT or vLLM. However, the leap to handling a 671 billion-parameter model like DeepSeek V3 introduces challenges that exceed the GPU's native capacities.

Tokens Per Second (TPS) Analysis

Synthesizing Source Insights

The concept of tokens per second (TPS) is integral to evaluating the performance of language models. It measures how many tokens a model can process or generate each second during inference. For DeepSeek V3 671B running on an RTX 4090 with quantization, the TPS is influenced by multiple factors, including quantization level, model architecture, and hardware optimizations.

Estimated TPS Ranges

Analyzing the provided sources reveals a spectrum of estimated TPS for DeepSeek V3 671B on an RTX 4090:

Source A suggests a TPS range of 5–15 tokens per second for a 4-bit quantized model in a well-optimized setup.
Source B provides a more conservative estimate, indicating a TPS of 1–5 tokens per second due to significant memory and computational constraints.
Source C highlights the impracticality of running DeepSeek V3 671B on a single RTX 4090, pointing out "Out of Memory" (OOM) errors even with multiple GPUs, thus implying negligible TPS capabilities.
Source D notes that while a standard form of DeepSeek V3 can achieve up to 60 TPS, quantized versions on less capable hardware like a Mac Mini cluster only reach around 5.37 TPS. Extrapolating to an RTX 4090 suggests potential TPS might slightly improve but remain within a similar low range.

Factors Affecting TPS

Several critical factors determine the achievable TPS for DeepSeek V3 671B on an RTX 4090:

Factor	Impact on TPS
Quantization Level	Lower bit-widths (e.g., 4-bit) reduce memory usage and can enhance TPS, but may slightly degrade model accuracy.
Batch Size	Smaller batch sizes (e.g., batch size of 1) can result in higher TPS compared to larger batches due to reduced memory overhead.
Model Optimization	Utilizing frameworks like TensorRT or vLLM can optimize inference processes, potentially improving TPS.
Hardware Constraints	The RTX 4090's 24 GB VRAM and 1 TB/s memory bandwidth are substantial but may still fall short for efficiently handling a 671B-parameter model even in quantized form.
Model Architecture	The MoE architecture activates 37B parameters per token, requiring significant computational resources per inference request.

Challenges of Deploying DeepSeek V3 671B on RTX 4090

Memory Limitations

One of the most significant hurdles in deploying DeepSeek V3 671B on a single RTX 4090 is the memory requirement. Even with aggressive quantization, the model demands approximately 312 GB of memory at Q4 quantization levels, vastly exceeding the RTX 4090's 24 GB VRAM. This discrepancy leads to frequent "Out of Memory" (OOM) errors, rendering single or even multiple RTX 4090s insufficient for practical deployments.

Computational Overheads

Beyond memory constraints, the computational overhead associated with running such a large model is substantial. The RTX 4090, while powerful, is primarily optimized for models up to several billion parameters. Scaling up to 671 billion parameters introduces immense computational demands that the GPU struggles to meet, resulting in low TPS despite quantization efforts.

Mixture-of-Experts (MoE) Complexity

The MoE architecture employed by DeepSeek V3 activates a significant subset of the total parameters per token. Specifically, activating 37 billion parameters per token intensifies the memory and processing requirements. This complexity further exacerbates the challenges of deploying the model on hardware with limited VRAM and computational capacity.

Potential Optimizations and Solutions

Advanced Quantization Techniques

Leveraging the highest levels of quantization, such as 4-bit quantization, can substantially reduce the memory footprint of DeepSeek V3 671B. This reduction makes it marginally more feasible to run the model on an RTX 4090. However, even with these optimizations, the TPS remains limited due to the inherent size of the model.

Framework Optimizations

Utilizing advanced inference frameworks like TensorRT-LLM or SGLang can optimize the model's performance on NVIDIA GPUs. These frameworks are designed to enhance the efficiency of large-scale models, potentially improving TPS. Nonetheless, the degree of improvement may not bridge the substantial performance gap posed by the RTX 4090’s hardware limitations.

Distributed Computing

To effectively run DeepSeek V3 671B, distributed computing across multiple high-end GPUs or specialized hardware setups is often necessary. Solutions like NVIDIA H200 GPU clusters are better suited for handling such immense models, offering the combined memory and computational power required to achieve meaningful TPS levels.

Comparative TPS Performance

To provide a clearer perspective on the expected TPS for DeepSeek V3 671B on an RTX 4090, the following table consolidates estimates from various sources:

Source	TPS Estimate	Quantization Level	Additional Notes
Source A	5–15 tokens/sec	4-bit	Well-optimized setup on RTX 4090
Source B	1–5 tokens/sec	4-bit	Heavily constrained by VRAM and compute limitations
Source C	N/A	4-bit	Experiences OOM errors even with multiple GPUs
Source D	Up to 5.37 tokens/sec	4-bit	Based on performance on an M4 Mac Mini cluster

This table illustrates that while quantization offers some improvements in TPS, the overall performance remains constrained by the RTX 4090’s hardware limitations. The variance across sources underscores the complexity of deploying such a large model on consumer-grade GPUs.

Conclusion

Deploying DeepSeek V3 671B on an NVIDIA RTX 4090, even with quantization, presents significant challenges primarily due to the model's immense size and the GPU's memory and computational constraints. While quantization techniques can marginally improve TPS, the achievable performance typically ranges between 1 to 15 tokens per second, depending on optimization levels and specific implementation details.

For practical and efficient deployment of DeepSeek V3 671B, especially in applications demanding higher TPS, specialized hardware configurations such as GPU clusters are advisable. These setups can provide the necessary memory bandwidth and VRAM to handle the model's requirements effectively, ensuring both speed and reliability in inference tasks.