Code Change Request

# 27224

Back to Code Changes

Christopher
Technical Support
StableBit CloudDrive
1.0.0.744
Windows 10 (64 bit)
Public
Alex

* [Issue #27224] Fixed uploads getting stuck in WholeChunkIoImplementation.
* [Issue #27224] Dynamically scale the upload / download buffer size depending on the transfer speed. This prevents transfer timeouts 
                 on slower connections and lowers the transfer overhead for faster connections.
Public
Alex

I have an appointment for this on Monday, but for the past week or so I've been working on my own locking up issue that became readily reproducible. It's an extremely puzzling case because it was only happening on one particular cloud drive, and on one particular machine. Creating an identical cloud drive with identical data on the same machine would not cause the lockup.

For me, the problem was, the entire system would lock up daily, when CloudDrive was installed. When I investigated with a debugger, it seemed that the VSS (volsnap.sys) driver was responsible for locking up the C: drive, and everything would snowballs from there. Without CloudDrive installed, there were no lockups at all. I've had the hardest time trying to reproduce it on cue, but after spending a lot of time with it I figured out how to do it.

The lockup issue is triggered by having VSS enabled on the drive, and by simply disabling VSS on the cloud drive and rebooting, the lockups would completely stop. You can disable VSS from the Essentials Dashboard (Storage -> Hard Drives tab), or from Explorer's disk properties pane on other server OSes.

The way to reproduce the lockup itself was like this (and this is bizarre):
  1. Disable VSS on the cloud drive (to clear all VSS images)
  2. Enable VSS on the cloud drive.
  3. Take 1 VSS snapshot of the cloud drive.
  4. Perform "chkdsk /scan s:" 3 times in a row (on the cloud drive).
On the 3rd time, it would lock up. It's very odd and I have no idea why it was doing that. Like I said before, creating an identical cloud drive, with identical data on the same system and following the above steps did not reproduce the problem. It only affected that one drive, but it was reproducible at will, on that single drive by following the steps above.

I attempted a number of things to work around the problem, and finally made some progress yesterday on a workaround. I've solidified the workaround in build .755 and it should effectively work around this and similar lockups in the future.

The reason why the system was locking up is as follows:
  • When performing a chkdsk /scan on a drive, the first thing that chkdsk does is it takes a VSS snapshot of the drive.
  • In the process of preparing to take the snapshot, the VSS driver (volsnap.sys) will block I/O to the cloud drive while it performs some maintenance tasks on the "Diff" area. These tasks are performed on a separate worker thread, and involve reading and writing various data to and from the cloud drive.
  • While it is performing these tasks, volsnap, which sits right below NTFS, will also actively block all write I/O to all the volumes in the system.
  • This is a huge problem for CloudDrive (and technically any virtual disk solution that writes data to other volumes).
    • In order for CloudDrive to service a write (from the VSS snapshot preparation step above), it needs to issue the write to the temporary cache file on the cache disk.
    • But since volsnap is preparing to take a snapshot, and is blocking all I/O, it can't service that write, and the VSS preparation step never finishes.
    • This results in a system-wide deadlock, and everything grinds to a halt.
Now the strangest part of all of this is that it doesn't happen on every cloud drive. It was only happening on that one specific drive. I can only assume that it has something to do with the on-disk layout that was causing volsnap to take a different path and to lock up. I've tried to disassemble volsnap, but it's a complete mess of code. All I can tell is that it has something to do with double buffering writes.

The million dollar question is, why does volsnap block I/O to other volumes while taking a snapshot of the cloud drive? It really should never do that, as that can lead to deadlocks (as it did here).

I guess the moral of the story is, if there is a filter driver installed in the system (that is part of the storage stack), that while performing I/O on a particular drive, will block write I/O to other volumes, this can cause a deadlock with CloudDrive. Build .755 should work around all of these kinds of issues. In addition, if this kind of problem is encountered in build .755, the user feedback system will now throw a warnings saying that "Cloud drive X is having trouble writing to cache drive Y." (hopefully without locking up the whole system).