Sherief, FYI

GPU Fault Telemetry

In the classic world of OpenGL and Direct3D <= 11, an app would create GPU resources (buffers, textures) and receive opaque handles to them which it later binds to specific slots before issuing draw or compute commands. While some forms of bindless work existed, for the most part the GPU driver would know which resources were being used for which draw / compute commands and the driver managed the GPU Virtual Addresses (VAs) of these resources and kept them hidden from the user. I’m aware of GL_NV_shader_buffer_load, which is like a wormhole to modern API functionality, but in the world of classic APIs it’s the exception rather than the rule.

In D3D12 / Vulkan, the app is directly exposed to GPU memory management and resource allocation, and the GPU VAs are explicit and exposed to the application. This grants the application authors lots of power and optimization opportunities, but at the same time introduces them to some of the same hazards that come with CPU memory management via pointers as now the GPU has the potential to cause fatal page faults and access violations.

In CPU land, it’s possible for an app to intercept and log memory access violations via SIGSEGV handlers on Unix-like systems and SetUnhandledExceptionFilter() on Windows. In both cases you can get information about the memory address whose access caused the fault and whether the access was a read or a write - check your Unix platform’s man pages, or for Windows see the EXCEPTION_RECORD struct, namely its ExceptionCode and ExceptionInformation members.

Unfortunately, there doesn’t seem to be a straight-forward, vendor-neutral, negligible-or-zero-overhead method for retrieving page faults occuring under D3D12. There’s NVIDIA Nsight Aftermath since a few years now, the recently launched AMD Radeon GPU Detective (RGD), and the closest thing to a vendor-neutral solution: D3D12 Device Removed Extended Data (DRED).

Aftermath has non-zero overhead. RGD requires a reproducible crash that you can repro at-will while running under RGD - and it only supports RDNA2 and RDNA3 which is too short of a list IMO. DRED, according to its authors, has a 2% to 5% overhead in AAA settings which also precludes it from being always-on in a retail deployment.

None of this fit my needs. I’d like something that is as vendor neutral as I can get, negligible-to-zero-overhead, and suitable for shipping to end users in retail. I had to roll my own, and it took me quite some platform header spelunking to figure out good leads.

Let’s work backwards. The data I need can be found in the D3DKMT_DEVICEPAGEFAULT_STATE struct, especially the FaultedVirtualAddress. I can get it through the D3DKMTGetDeviceState function when providing D3DKMT_DEVICESTATE_PAGE_FAULT as the StateType. But the D3DKMT_GETDEVICESTATE struct passing the arguments to that function needs an hDevice, and the D3D12 user mode API does not expose that - it’s not the ID3D12Device* that you use to call device functions.

So far I can get the page fault info if I have the D3DKMT_HANDLE hDevice, but no amount of header spelunking lead me to find a mapping from ID3D12Device to hDevice (if you know one exists and I’ve missed it, please reach out to me).

I decided, out of necessity, to get creative. I grabbed minhook and added it to my project, and before I create a D3D12 device I’d hook D3DKMTCreateDevice:

MH_STATUS s = MH_OK;
HMODULE modGdi32 = LoadLibraryA("gdi32.dll");
assert(modGdi32);
void* pfnD3DKMTCreateDevice = GetProcAddress(modGdi32, "D3DKMTCreateDevice");
assert(pfnD3DKMTCreateDevice);
s = MH_CreateHook(pfnD3DKMTCreateDevice, D3DKMTCreateDeviceHook, (void**)&origD3DKMTCreateDevice);
if(s != MH_OK)
{
    panic("hooking D3DKMTCreateDevice() failed");
}
assert(origD3DKMTCreateDevice);
s = MH_EnableHook(pfnD3DKMTCreateDevice);
if(s != MH_OK)
{
    panic("hook enable failed");
}

The hook involved was simple and functional: I’d call the original function, and if it succeeds I’d stash the hDevice returned in a global - right after I call the UMD’s D3D12CreateDevice() I grab the hDevice from that global:

decltype(D3DKMTCreateDevice)* origD3DKMTCreateDevice = 0;
D3DKMT_HANDLE stashedDeviceHandle = 0;

NTSTATUS APIENTRY D3DKMTCreateDeviceHook(
    D3DKMT_CREATEDEVICE *params
)
{
    NTSTATUS r = origD3DKMTCreateDevice(params);
    if(r == 0) //STATUS_SUCCESS
    {
        stashedDeviceHandle = params->hDevice;
    }
    return r;
}

Et voila! This works in a retail setting, and has zero overhead until a page fault occurs. Once I see a device removed, I grab the page fault data and look at the fault’s GPU VA. With logging / tracking of the VAs for all allocated heaps and resources, searching for the closest interval [VAstart, VAend] to the fault address usually yields some pointers that help post-mortem investigation. A save game made at that point gets attached to the fault report and now I can investigate GPU page faults that happen in the wild with no overhead, across all vendors on supporting hardware, and get a save game and a finger-of-blame pointing at a resource. It’s such a boon for bindless on modern APIs and I wish the user-facing APIs exposed something like this.

In the future I’d like to investigate whether this works for Vulkan devices too. I think it should, but you can’t be too sure till you try. Here’s hoping I eventually have the bandwidth.