David Bakin’s programming blog.

ReFS disk scrubbing doesn’t play nice with other work on the same disks

[Update: found the way to schedule ReFS disk scrubbing, see the end of the post.]

ReFS has great data integrity features, especially when running it on top of a Windows Storage Spaces resilient volume (e.g., when mirrored).  You can set it to do full file content integrity, which it does by keeping checksums of everything that’s written and then periodically scrubbing the disk and comparing the actual contents read to the expected checksum.  If one mirror reads bad data and the other mirror is correct then ReFS will fix up the bad copy.  This is all great stuff!  Except when it isn’t, of course … (why can’t I have my cake and eat it too?)

Today I was experiencing extremely sucky disk performance and couldn’t figure out why.  (If you must know, μTorrent kept reporting “disk overloaded” even when download/upload speeds were fairly low.)  I remembered that in the past I would have a day or two where I’d have extremely sucky disk performance but it would go away.  This time, it was annoying enough that I wanted to find out what was wrong.

TL;DR version: Periodic disk scrubbing had kicked in and was running full throttle on the disk.

I investigated this way:  First I looked at the Resource Monitor for Disk usage.  It showed that a volume I wasn’t using was having continuous high traffic.  In fact, it showed that System (PID 4) was reading a 70Gb tar file in a directory of 400Gb of tar files that I never ever touch.  (It is an enormous repository of Java/C++ sources I acquired for a project I started and haven’t actually worked on in a long time.)  I then checked Windows Defender:  It wasn’t scanning, and MsMpEng.exe wasn’t using any CPU either.  I don’t know what sparked my thought process, but I finally googled for “ReFS disk integrity” and found a suggestion to check the Event Log Microsoft/Windows/DataIntegrityScan (under Applications and Services Logs).

Sure enough, it showed a scan of my ReFS volume had commenced in the early afternoon and was still going at 9PM.  Looking back in the log just a short way I found the last such scan ran 3 weeks ago and took 40 hours to complete!  (It’s a 4.3Tb volume, striped as well as mirrored; I typically get sustained read speeds of ~170-180Mb/s and Resource Monitor was showing System reading this tar file at around 110Mb/s.) (I also discovered, in the logs, events showing that if the scrub is interrupted by rebooting it continues after boot.  I did reboot a couple of times today in order to fix an issue with my Logitech mouse device driver.  (Don’t ask.) I don’t know if it restarts the scan or continues from where it got interrupted, I presume the latter.)

To be perfectly precise here, my problem may be that I have two volumes running on the same underlying Windows Storage pool, that is, on the same disks.  The ReFS volume and an NTFS volume.  My μTorrent traffic is directed at the NTFS volume (so I can use smaller 4Kb disk clusters which plays better with μTorrent).  It is possible that the scrubber would behave better if the ReFS volume was the only user of the underlying Storage Pool.  (But if that’s the issue, it is rather lame for an otherwise very well implemented feature.)

I can’t find any documentation or blog posts anywhere on the net that explains how to either schedule these scrubs or cause it to throttle itself.

Update: The ReFS disk scrubber runs on a task schedule – see the Task Scheduler under Task Scheduler Library/Microsoft/Windows/Data Integrity Scan. Change the schedule to what will not impact you, or disable it altogether … but remember to run it manually before you get into trouble! I haven’t found a way to throttle so that it can run slowly and steadily without impacting other work on the box.

And after I did this I still had problems with unresponsiveness … and much more often! I tracked that down to Regular Maintenance. I can’t tell everything that goes on during Regular Maintenance but at least part of it is the defragger, which, on a large volume, is terribly slow. I had to go to Task Scheduler Library/Microsoft/Windows/TaskScheduler and disable Idle Maintenance—because even though I configured it to stop when the computer was no longer idle it just kept going and going and going. Also I changed the schedule on Regular Maintenance so it happens a lot less often (like, every other weekend). And finally, I disabled Maintenance Configurator because if you let that run it automagically resets your changes to the other maintenance tasks. (I forget where I read about that necessary fix but I wish I did so I could thank the guy.) I wish I knew if there was any “maintenance” I’ve turned off that I’ll miss later …