Artificial Intelligence and deep learning are constantly in the headlines these days, whether it be ChatGPT generating poor advice, self-driving cars, artists being accused of using AI, medical advice from AI, and more. Most of these tools rely on complex servers with lots of hardware for training, but using the trained network via inference can be done on your PC, using its graphics card. But how fast are consumer GPUs for doing AI inference?
We’ve benchmarked Stable Diffusion, a popular AI image creator, on the latest Nvidia, AMD, and even Intel GPUs to see how they stack up. If you’ve by chance tried to get Stable Diffusion up and running on your own PC, you may have some inkling of how complex — or simple! — that can be. The short summary is that Nvidia’s GPUs rule the roost, with most software designed using CUDA and other Nvidia toolsets. But that doesn’t mean you can’t get Stable Diffusion running on the other GPUs.
We ended up using three different Stable Diffusion projects for our testing, mostly because no single package worked on every GPU. For Nvidia, we opted for Automatic 1111’s webui version (opens in new tab). AMD GPUs were tested using Nod.ai’s Shark version (opens in new tab), while for Intel’s Arc GPUs we used Stable Diffusion OpenVINO (opens in new tab). Disclaimers are in order. We didn’t code any of these tools, but we did look for stuff that was easy to get running (under Windows) that also seemed to be reasonably optimized.
We’re relatively confident that the Nvidia 30-series tests do a good job of extracting close to optimal performance — particularly when xformers is enabled, which provides an additional ~20% boost in performance. RTX 40-series results meanwhile are a bit lower than expected, perhaps due to lack of optimizations for the new Ada Lovelace architecture.
The AMD results are also a bit of a mixed bag, but they’re the reverse of the Nvidia situation: RDNA 3 GPUs perform quite well while the RDNA 2 GPUs seem rather mediocre. Finally, on Intel GPUs, even though the ultimate performance seems to line up decently with the AMD options, in practice the time to render is substantially longer — probably a lot of extra background stuff is happening that slows it down.
We’re also using Stable Diffusion 1.4 models, rather than the newer SD 2.0 or 2.1, mostly because getting SD2.1 working on non-Nvidia hardware would have required a lot more work (i.e. learning to write code to enable support). However, if you have some inside knowledge of Stable Diffusion and want to recommend different open source projects that may run better than what we used, let us know in the comments (or just email Jarred (opens in new tab)).
Our testing parameters are the same for all GPUs, though there’s no options for a negative prompt option on the Intel version (at least, not that we could find). The above gallery was generated using the Nvidia version, with higher resolution outputs (that take much, much longer to complete). It’s the same prompts but targeting 2048×1152 instead of the 512×512 we used for our benchmarks. Here are the pertinent settings:
Positive Prompt:
postapocalyptic steampunk city, exploration, cinematic, realistic, hyper detailed, photorealistic maximum detail, volumetric light, (((focus))), wide-angle, (((brightly lit))), (((vegetation))), lightning, vines, destruction, devastation, wartorn, ruins
Negative Prompt:
(((blurry))), ((foggy)), (((dark))), ((monochrome)), sun, (((depth of field)))
Steps:
100
Classifier Free Guidance:
15.0
Sampling Algorithm:
Some Euler variant (Ancestral, Discrete)
The sampling algorithm doesn’t appear to majorly affect performance, though it can affect the output. Automatic 1111 provides the most options, while the Intel OpenVINO build doesn’t give you any choice.
Here are the results from our testing of the AMD RX 7000/6000-series, Nvidia RTX 40/30-series, and Intel Arc A-series GPUs. Note that each Nvidia GPU has two results, one using the default computational model (slower and in black) and a second using the faster “xformers” library from Facebook (opens in new tab) (faster and in green).
As expected, Nvidia’s GPUs deliver superior performance — sometimes by massive margins — than anything from AMD or Intel. However, there are clearly some anomalies. The fastest GPU in our initial testing is the RTX 3090 Ti, topping out at nearly 20 iterations per second, or about five seconds per image using the configured parameters. Things fall off from there, but even the RTX 3080 basically ties AMD’s new RX 7900 XTX, and the RTX 3050 bests the RX 6950 XT. But let’s talk about the oddities.
First, we expected the RTX 4090 to crush the competition, and that clearly didn’t happen. In fact, it’s slower than AMD’s 7900 XT, and also slower than the RTX 3080. Similarly, the RTX 4080 lands between the 3070 and 3060 Ti, while the RTX 4070 Ti sits between the 3060 and 3060 Ti.
Proper optimizations could easily double the RTX 40-series cards’ performance. Similarly, considering the significant performance gap between the RX 7900 XT and the RX 6950 XT, its optimizations might be able to double the performance of the RDNA 2 GPUs as well. That’s just a ballpark guess based on what we’ve seen in our GPU benchmarks hierarchy, but there’s definitely something odd with these initial results.
Intel’s Arc GPUs currently deliver very disappointing results, especially since they support XMX (matrix) operations that should deliver up to 4X the throughput as regular FP32 computations. We suspect the current Stable Diffusion OpenVINO project that we used also leaves a lot of room for improvement. Incidentally, if you want to try and run SD on an Arc GPU, note that you have to edit the ‘stable_diffusion_engine.py’ file and change “CPU” to “GPU” — otherwise it won’t use the graphics cards for the calculations and takes substantially longer.
Back to the results. Using the specified versions, Nvidia’s RTX 30-series cards do great, AMD’s RX 7000-series cards do great, the RTX 40-series underperforms, RX 6000-series really underperforms, and Arc GPUs look generally poor. Things could change radically with updated software, and given the popularity of AI we expect it’s only a matter of time before we see better tuning (or find the right project that’s already tuned to deliver better performance).
Again, it’s not clear exactly how optimized any of these projects are, but it might be interesting to look at the maximum theoretical performance (TFLOPS) from the various GPUs. The following chart shows the theoretical FP16 performance for each GPU, using tensor/matrix cores where applicable.
Those Tensor cores on Nvidia clearly pack a punch, at least in theory, though obviously our Stable Diffusion testing doesn’t match up exactly with these figures. For example, on paper the RTX 4090 (using FP16) is up to 106% faster than the RTX 3090 Ti, while in our tests it was 35% slower. Note also that we’re assuming the Stable Diffusion project we used (Automatic 1111) doesn’t even attempt to leverage the new FP8 instructions on Ada Lovelace GPUs, which could potentially double the performance on RTX 40-series again.
Meanwhile, look at the Arc GPUs. Their matrix cores should provide similar performance to the RTX 3060 Ti and RX 7900 XTX, give or take, with the A380 down around the RX 6800. In practice, Arc GPUs are nowhere near those marks. The fastest A770 GPUs land between the RX 6600 and RX 6600 XT, the A750 falls just behind the RX 6600, and the A380 is about one fourth the speed of the A750. So they’re all about a quarter of the expected performance, which would make sense if the XMX cores aren’t being used.
The ratios on Arc do look about right, though. Theoretical compute performance on the A380 is about one-fourth the A750. Most likely, they’re using shaders for the computations, in full precision FP32 mode, and missing out on some additional optimizations.
The other thing to notice is that theoretical compute on AMD’s RX 7900 XTX/XT improved a lot compared to the RX 6000-series, and memory bandwidth isn’t a critical factor — the 3080 10GB and 12GB models land relatively close together. So, perhaps the AMD results above aren’t completely out of the question, as the 7900 XTX does nearly triple raw compute compared to the 6950 XT. Except the 7900 XT performs nearly as well as the XTX, where raw compute should favor the XTX by about 19% rather than the 3% we measured.
Ultimately, this is more of a snapshot in time of Stable Diffusion performance on AMD, Intel, and Nvidia GPUs rather than a true statement of performance. With full optimizations, the performance should look more like the theoretical TFLOPS chart, and certainly newer RTX 40-series cards shouldn’t fall behind existing RTX 30-series parts.
Which brings us to one final chart, where we did some higher resolution testing. We didn’t test all the new GPUs yet, and we used Linux on the AMD RX 6000-series cards that we tested. But it looks like the more complex target resolution of 2048×1152 started to take better advantage of the RTX 4090 at least. We’ll see about revisiting this topic more in the coming year.