CUDA parallel processing cores: 2304
NVIDIA Tensor Cores: 288
NVIDIA RT Cores: 36
Memory: 8 GB GDDR6
RTX-OPS: 43T
Raycasting: 8 Giga Rays/Sec
Maximum single precision (FP32) performance: 7.1 TFLOPS
Maximum single precision (FP16) performance: 14.2 TFLOPS
Highest integer arithmetic (INT8) performance: 28.5 TOPS
Deep Learning: TeraFLOPS1 57.0 TFLOPS
Memory Interface: 256-bit
Memory Bandwidth: Up to 416 GB/s
Maximum power consumption: 160 W
Bus: PCI Express 3.0 x 16
Display Connector: DP 1.4 (3) + VirtualLink (1)
Form Factor: 4.4”H x 9.5”L
Weight: 479 g
Cooling scheme: Active
NVIDIA® 3D Vision® and 3D Vision Pro supported by 3 pin mini DIN
Frame Lock Compatible (with Quadro Sync II)
NVLink Interconnect Technology: None
External power supply: 8-pin PCIe
Performance characteristics
Turing GPU Architecture
The Quadro RTX 4000 GPU is manufactured on the most advanced 12nm FFN (FinFET NVIDIA) high-performance process, custom-made for NVIDIA, contains 2304 CUDA cores, and is the most powerful computing platform for HPC, AI, VR and graphics workloads on the professional desktop . The Turing GPU architecture represents the biggest leap forward in real-time computer graphics imaging since NVIDIA invented programmable shaders in 2001. It integrates 13.6 billion transistors in a size of 545 square millimeters, providing more than 7.1 TFLOPS single precision (FP32), 14.2 TFLOPS half precision (FP16), 28.5 TOPS integer precision (INT8), and 57.0 TFLOPs Tensor computing power, Perfect support for all kinds of computationally intensive work add-ons.
RT core
New hardware ray tracing technology enables the GPU to produce, for the first time, film-quality photorealistic objects and environments in real time, including accurate physical shadows, reflections, and refractions. The real-time ray tracing engine works with NVIDIA OptiX, Microsoft DXR, and the Vulkan API to deliver a level of realism far beyond what traditional imaging techniques can achieve. The RT core uses pixel-by-pixel casting to accelerate Bounding Volume Hierarchy (BVH) traversal and ray casting capabilities.
Enhanced Tensor Cores
The new mixed-precision core is designed for deep learning matrix operations, delivering 8x the TFLOPS of the previous generation when training. The Quadro RTX 4000 utilizes 288 Tensor Cores, each of which can perform 64 floating-point fused multiply-add (FMA) operations per frequency, for a total of 1024 independent floating-point operations per SM per frequency. In addition to supporting FP16/FP32 matrix operations, the new Tensor core adds INT8 (2048 integer operations per frequency) and experimental INT4 and INT1 (binary) precision modes for matrix operations.
Advanced Shading Technology
Mesh Shading: Computation-based geometry pipeline to speed up geometry processing and culling for geometrically complex models and scenes. Mesh shading provides up to two times the performance boost for workloads constrained by geometric capabilities. Variable Rate Shading (VRS): Change the shading rate based on scene content, gaze direction, and motion to improve imaging efficiency. Variable Rate Shading provides similar image quality, but shades 50% fewer pixels. Material space shading: Object/material space shading improves performance for pixel shading-heavy workloads such as depth of field and motion blur. Material Space Shading For pixel-shading-heavy VR workloads, reuse pre-shaded material pixels to improve throughput and increase realism.
High-performance GDDR6 memory
Featuring Turing's highly optimized 8GB GDDR6 memory subsystem with the industry's fastest graphics memory (416 GB/s peak bandwidth), the Quadro RTX 4000 is the ideal platform for latency-sensitive applications that specialize in processing large datasets. The Quadro RTX 4000 offers 70% more memory bandwidth than the previous generation.
Single Instruction, Multiple Threads (SIMT)
The new independent thread scheduling feature shares resources among smaller jobs, enabling finer synchronization and cooperation between parallel threads.
Advanced Streaming Multiprocessor (SM) Architecture
Combines shared memory and L1 cache to dramatically improve performance, simplifying programs and reducing required tuning for optimal application performance. Each set of SMs contains 96 KB of L1/shared memory, which can be configured in various capacities based on computing or graphics workloads. For compute workloads, up to 64 KB can be allocated to L1 cache and shared memory, while graphics workloads can allocate up to 48 KB to shared memory; 32 KB L1 and 16 KB texture units. Combining L1 cache and shared memory reduces latency and provides higher bandwidth.
mixed precision arithmetic
16-bit floating-point precision operations to double throughput and reduce storage requirements for training and deployment of larger neural networks. Turing SM has independent parallel integer and floating point data paths, making it more efficient for workloads with a mix of arithmetic and address calculations.
Graphics preemption
Pixel-level preemption provides more granular control and better support for time-dependent work, such as VR motion tracking.
Computational preemption
Instruction-level preemption provides finer-grained control over computational work to avoid long-executing applications monopolizing system resources or timing out.
H.264 and HEVC encoding/decoding engines
Two dedicated H.264 and HEVC encoding engines and a decoding engine independent of the 3D/computing pipeline provide faster-than-real-time performance for transcoding, video editing, and other encoding applications.
NVIDIA GPU BOOST 4.0
Automatically maximize application performance without exceeding the power consumption and thermal envelope of the card. Allows the application to stay at the boost frequency for longer at higher temperatures before it drops to the base frequency set at the second temperature. This feature requires a software application to start, not a standalone program.