Microsoft recently announced DirectStorage’s PC availability in a blog post, and I was immediately excited to try it out.
DirectStorage was developed for the Xbox Series consoles, which have a high-speed NVMe SSD, a hardware decompression engine, and a Unified Memory Architecture (UMA) where the CPU and GPU share the same memory pool. The API was designed to extract the highest throughput from that system, and on Xbox it succeeds at that goal.
But my computer is not an Xbox. To move data from compressed bytes on SSD storage to GPU-readable textures on Xbox, the compressed data goes from the SSD to the hardware decompression engine directly and the hardware decompression engine writes the decompressed stream to the unified memory pool and the data’s ready. On a PC, the GPU and CPU have separate memory pools and there’s no hardware decompression engine, so how does DirectStorage (a performance oriented API) deal with that?
In a performance oriented scenario the data is compressed on-disk, and so a decompression step is required. On PC, there are two paths: either CPU decompression or GPU decompression - while the latter isn’t currently implemented, I won’t hold that against DirectStorage and I’ll explore how both paths work assuming a future version of DirectStorage that does include GPU decompression support, since that’s been announced as something in the works and I want to give DirectStorage the best possible footing in this comparison.
In the CPU decompression case DirectStorage will move compressed chunks to system memory, and on non-UMA devices which are the majority of high-performance GPUs out there it will run CPU decompression code that writes to an upload / staging heap that is GPU visible, then fire off a copy command list to use the GPU’s copy engine and transfer the data to discrete GPU memory. There’s multiple problems with that approach: first, the upload heap DirectStorage creates is implicit and hidden from the user, and for any high performance code implicit hidden allocations are A Bad Thing. Second, the size of this heap is specified per DirectStorage factory, not per queue - and the size of this heap eats into existing system memory and is a crucial value to tune based on the title’s requirements, graphics detail level, and installed memory size on the client’s device. Third, DirectStorage creates a hidden, implicit copy command queue on your Direct3D 12 device, and if you care about performance the last thing you want is someone else creating queues on your Direct3D12 device and scheduling work on them - yes, Windows is a multitasking OS and the GPU is multiplexed anyway, but when you’re the foreground app you get a boost to that scheduling and DirectStorage will be, in a hidden manner you cannot submit timer queries to, interfering with that.
In the GPU decompression ideal case DirectStorage will be copying compressed data to video memory directly (yay!) then running its own shader on a (probably) hidden, implicit compute command queue to decompress - and if you didn’t like a hidden, implicit copy queue on your D3D12 device you’ll absolutely hate a surprise compute command queue being scheduled alongside your rendering work. In general, for high performance system level APIs things like allocating memory and running compute must absolutely be under the control of the caller - look at how almost all APIs used in realtime situations like games support custom allocator interfaces. Abstractions aren’t bad, but they don’t belong in lower-level APIs like this. It’s kind of like D3D11 versus D3D12: the average D3D11 application will easily beat the performance of the average D3D12 application, but the best possible D3D12 application will absolutely wipe the floor with the best possible D3D11 application. We need both in our ecosystem.
Let’s say I wanted to make the best out of this situation - I’ll manage my own upload heap, use my own CPU decompression and even write my own compute-based GPU decompression shaders for my own GPU-friendly compression format. In this case I just want DirectStorage to move compressed data from the SSD to either system memory, where I’ll run my decompression code and write the output to an upload heap and be responsible for submitting the copy command lists to do the actual upload, or move compressed data from the SSD to a video memory UAV and then once again I’m responsible for scheduling my own decompression work. The latter case works, and you can wait on a GPU fence for DirectStorage IO to complete (although it must be noted that the direct NVMe to video memory path is not currently implemented AFAIK, you can code today assuming it is and it will automagically work once the implementation is live, so zero issues there). The DirectStorage to system memory copies, however, have no way of indicating completion whatsoever. Reiterating, since this surprised and shocked me, there is absolutely now way to find out whether or not a DirectStorage SSD-to-memory IO request has completed - no polling, no waiting, nothing. This is the first IO API I have ever seen in my entire life that has you initiate an IO request and not provide any way whatsoever to know whether that request has completed or not. I am not sure how an API like this can see the light of day, but Microsoft seems to be working on it.
DirectStorage on PC is a minimal effort to get the code that compiles on Xbox Series consoles to compile and run as-is or with the slightest changes on a PC. However, there are fundamental architectural differences between a PC and an Xbox Series console and so far this effort is akin to filing the corners of a square peg to get it to fit, awkwardly, in a round hole. The abstraction is leaking around the edges.
If you really care about IO performance on Windows, ignore DirectStorage for the time being and either use I/O Completion Ports or look into the (barely documented) I/O rings.