Sherief, FYI

Arkham Tarpit

Ever since Batman: Arkham Knight hit PC and its performance profile has been quite something - despite running well on consoles, it launched in an abysmal state best covered in this TotalBiscuit YouTube video and I found it legitimately fascinating to find a port with such huge differential between its console and PC versions. At first I thought it was something that can be fixed by waiting for a generation or two of hardware to brute-force through it, but that was never really the case. I was intrigued, and I dove into it with the result being my Arkham Quixote patch / fix.

Until this day, Arkham Knight’s performance is a hot discussion topic - many Digital Foundry videos were made about it over the years, and Richard Leadbetter once called the game’s performance his Moby Dick. On DF Direct Weekly #75 one of the questions asked by DF Supported James Post was about Arkham Knight, and why going back to it with a much beefier system and more brute force didn’t seem to make the performance any better. I wanted to go over why exactly that’s the case, and what can (and cannot) be done about it in the future.

Let’s start by ruling out a common misconception: is Arkham Knight GPU bound, and would a faster GPU be able to brute-force its way through it? That doesn’t seem to be the case. The shaders / materials themselves run fine on last gen consoles, and the level of visuals involved (while being gorgeous to this day I must say) isn’t something that should be bringing James’s 3070 to its knees. We can also make an educated guess about the bottleneck not being the GPU by looking at the graphics API - in this case it’s D3D11, but in general across all graphics APIs it’s usually difficult to make work that doesn’t scale with a larger GPU. You can definitely make pipelines that scale less-than-linearly or that run into various bottlenecks like ROPs, memory bandwidth, etc. but in general the GPU programming model is designed to generate scalable workloads. The API is even designed in such a way to make it relatively easier for you to corrupt data in a parallel, scalable fashion than it is for you to serialize the process of corrupting said data. And the game came out seven years ago, GPU horsepower has increased by leaps and bounds since then.

What’s actually holding back Arkham Knight’s performance on PC vs consoles in this case seems to be mostly down to two things: a subtle D3D11 implementation detail slash footgun that’s absent on Unified Memory Architecture (UMA) machines like consoles, and various issues and inefficiencies in the Windows display driver stack codepaths that have no equivalent on consoles.

The first issue, a D3D11 implementation detail, has to do with how the game streams the world to the GPU. On a console you have UMA and a fixed hardware target, so streaming a new chunk of the world is a matter of doing a copy (with optional decompression etc) from the game’s assets to a region in memory and you’re done. On non-UMA systems like any PC with a discrete graphics card you have to copy from the game’s assets to something called a staging (or upload) buffer, which lives in memory that the CPU has write access to and the GPU has read access to, then the data is copied from that staging area to the GPU’s video memory. Since that upload memory has to remain valid until the GPU finishes reading it, and since the GPU operates asynchronously from the CPU, the app has two options: either it manages the lifetime of the buffers, or it can use a D3D11 feature called “buffer renaming”. Buffer renaming works behind the scenes when the app asks for read access to a buffer that still has pending reads on the GPU, only when a special flag is used: MAP_DISCARD - in this case the D3D runtime is responsible for handing the app a chunk of memory with the same size, but actually different from the one with pending GPU read operations.

Let’s consider an app that wants to stream two chunks of 32MB each - it can create two 32MB staging buffers and copy each chunk to one of the buffers, or it can create one 32MB staging buffer, map it so it’s CPU visible, write to it, issue a GPU read command from it, then immediately map the same buffer again with the special MAP_DISCARD flag that instructs the runtime to “discard” older results from the app’s perspective - in which case the runtime checks to see if the original chunk still has pending work and if it does, the runtime internally grabs another 32MB chunk from a reserve it has and presents that to the app.

The subtle footgun here is that the D3D11 runtime has a limit of 128MB of size for that reserve memory, so if you have more than 128MB of data in flight and you issue another request to map memory then even with MAP_DISCARD the runtime will wait for the GPU work reading earlier buffers to finish. This isn’t documented anywhere that I know of except in this one post on NVIDIA’s developer blog. As a result of this behavior, a map operation with MAP_DISCARD can finish anywhere from “instantly” to “a couple of frames later” and the app doesn’t have much visibility into this. The only way to get predictability is to rely on your own buffer management here. Arkham Knight doesn’t do that, unfortunately, and can have MAP_DISCARD operations that last for quite a while - you can check out my instrumented ReShade fork to check the timings out yourself, or simple fork ReShade and add some QueryPerformanceCounter() calls for the map operations. This is one reason some people are getting better luck when using DXVK on Windows with Arkham Knight.

The second issue is related to the topic of resource residency management on a general-purpose PC versus a console. In general, for a high-performance path, to render using a resource like a texture or a vertex buffer that resource needs to be “resident” in GPU memory, as in actually physically living in video ram. On console, this is managed very simply: the console OS / UI gets a fixed chunk of video ram, and the game gets a fixed chunk of video ram, and all the console OS has to do is make sure that while the game is running this one large chunk of its video memory allocation is actually living in video memory in its entirety. If the app is suspended it’s possible to evict this to disk, then restore it from disk when the app is un-suspended. I’d guess that’s how Xbox quick resume works, but I’m not familiar with the process since I’m neither an Xbox developer nor an Xbox owner. In short, residency management of GPU resources on consoles only has to handle a single GPU allocation and that’s that.

Enter PC. PCs come in various configurations, and PC apps come in varying levels of quality and craftsmanship, and it’s entirely possible and legal for a D3D11 application to launch, create a thousand textures, then only use two or three in its entire lifetime. Since video memory is a finite resource, residency management on PC is more complicated - only resources that are actually needed should actually be in video memory, and in cases of low video memory some command packets need to be split up: consider a command packet with six draw calls referencing 24 textures - if video memory can only fit 12 textures then the packet needs to be broken up into two parts with evict / make-resident steps in between. And all of this is system wide with many apps, each having different sets of created versus actually used resources. In D3D12, the app must explicitly handle residency with explicit Evict() and MakeResident() APIs, but D3D11 and earlier apps rely on the OS to manage residency for them, and there lies an issue: Arkham Knight doesn’t recycle or suballocate its resources (for more details see my Arkham Quixote post), and thus the allocation lists grow to a considerable size, and the system is hit with something it cannot brute-force its way through: quadratic algorithms.

Roughly, a quadratic algorithm is one whose execution time increases with the square of the input size while a linear one is one where the execution time increases right along the input size - so if a linear algorithm takes one second to process one item it would take ten seconds to process ten items, while a quadratic algorithm that takes one second to process one item will take one hundred seconds to process ten items. And Microsoft’s system code has so much quadratic algorithms in it that Bruce Dawson’s blog has an entire tag / section dedicated to them, and the driver layer is unfortunatley no exception. You cannot brute-force your way out of algorithmic complexity, and I don’t think managing the residency list is something that can be parallelized so you’d be bottlenecked on single-threaded performance in this case, which makes it even harder to brute force.

Some things can be done about this, it all depends on how much effort you can dedicate to it. A ReShade based wrapper like Arkham Quixote could take things one step further and allocate one big D3D11 buffer then reroute all renaming to it, bypassing the MAP_DISCARD bottleneck. I think in theory you can even translate D3D11 calls into D3D12 and bypass the OS residency management for the most part, but that would be a non-trivial amount of effort.

Maybe one day I’ll have the time to chase this great white whale, but for now I’m happy Arkham Quixote works well enough on my system.