I'm not sure if you're aware of it but there is a new but old kid on the hashing scene: Blake3 (https://github.com/BLAKE3-team/BLAKE3). Blake 1&2 were already impressive and adopted by many projects & apps but the Blake team has outdone itself this time. Blake3 has built-in multi-threading and is extremely fast. Multi-threading is a feature which except Blake2 only few and obscure algorithms have.
I can testify to their self-published benchmarks, based on my ongoing multi-threaded hashing development. In single-threading Blake3 beats the currently fastest DOpus internal algorithm SHA1 by leaps and bounds (300-310 MB/s vs 2.8 GB/s!), it's not even funny. And if I parallelize SHA1 with DOpus, cheat and perfectly optimize everything to SHA1's favor, only then SHA1 beats BLAKE3 but by a much smaller margin (4.8 GB/s max vs 2.8 GB/s) than vice versa. SHA1 wins only because I cheat in its favor and in an extremely unlikely situation; I reach with SHA1 typically at best 1.5-1.6 GB/s in multi-threading. And of course SHA1 is built-in to DOpus, as opposed to my script spawning CMD.exe processes for BLAKE3.
Leo said in this thread that there's a bug with files > 512 MB in SHA256 implementation. I wonder if you would consider including Blake3 when you revisit that part of your codebase or in a future DO release. Their binary release is apparently compiled from Rust but the Github page has all the C source code and docs.
Don't let it bother you.
The C-implementation is not multi-threaded as Rust is, but because of the SIMD instructions and the algorithm (something with Merkel-trees) the hashing is CPU-internally parallelized. Note the speed benchmark is single-threaded and on my machine I see very little difference in speed even if I disable multi-threading with "--no-mmap".
As the Blake3's team results show in the image in OP, in single-threading Blake3 is far faster than SHA1, and only marginally slower than multi-threaded Blake3. My script can beat it only if I cheat and use 1 identical file per CPU thread so that no single thread takes much longer than the rest.