Thursday, October 27, 2011

TargetInstance Redirection Problems for FastIO on WinXP

Frankly I don't expect this is a problem that many people will run into but I'd like to show some of the debugging that led me to figure out the problem and what the implications are. I've already explained how using TargetInstance might help filters and some of the issues associated with it in my post on File IO Redirection Between Volumes Using FltMgr and I also have a post on Handling IRP_MJ_NETWORK_QUERY_OPEN in a Minifilter and I encourage you to revisit those posts if you need a refresher. In the post on handling IRP_MJ_NETWORK_QUERY_OPEN my suggestion was to return STATUS_FLT_DISALLOW_FAST_IO if you don't want to deal with all the weird semantics it introduces. However, there is a small performance overhead associated with failing IRP_MJ_NETWORK_QUERY_OPEN in this manner so while I was chasing down some performance issues I decided to actually implement this path. The filter I was working on was a pretty classic design, returning STATUS_REPARSE to redirect IRP_MJ_CREATEs to a different volume. Also, let's use the simplifying assumption that the file name was exactly the same between the two volumes. This meant that in IRP_MJ_NETWORK_QUERY_OPEN I should be able to just redirect the request to a different volume by changing the TargetInstance to the instance associated with the other volume and the request would then follow down that path and get the attributes for the file on the other volume. And since there is no handle open as a result of this operation I didn't have to worry about subsequent operations and such.
I'll post some pseudocode because there is just too much infrastructure to set things up properly in the passthrough sample. There is an instance context that I use to figure out if we need to redirect requests and where to redirect them (if there is no context attached then I don't redirect anything):
typedef struct _MY_INSTANCE_CONTEXT {

    PFLT_INSTANCE InstanceToRedirectTo;

} MY_INSTANCE_CONTEXT, *PMY_INSTANCE_CONTEXT;
And the following piece of code I've added to PtPreOperationPassThrough:
        if (!NT_SUCCESS(status)) {

            PT_DBG_PRINT( PTDBG_TRACE_OPERATION_STATUS,
                          ("PassThrough!PtPreOperationPassThrough: FltRequestOperationStatusCallback Failed, status=%08x\n",
                           status) );
        }
    }

    if (Data->Iopb->MajorFunction == IRP_MJ_NETWORK_QUERY_OPEN) {

        status = FltGetInstanceContext( FltObjects->Instance, &instanceContext );

        if (NT_SUCCESS(status)) {

            //
            // send this request to the instance we want it to go to and we must
            // mark the FLT_CALLBACK_DATA dirty.
            //

            Data->Iopb->TargetInstance = instanceContext->InstanceToRedirectTo;

            FltSetCallbackDataDirty( Data );

            //
            // we'll release this in the postOp callback.
            //
            
            *CompletionContext = (PVOID)instanceContext;
        } else {

            if (status == STATUS_NOT_FOUND) {

                //
                // this isn't an instance for which we want to redirect this 
                // operation, send the request down and don't care about
                // the postOp Callback.
                //

                return FLT_PREOP_SUCCESS_NO_CALLBACK;

            } else {

                //
                // some other error. we can either fail the request here or
                // we can just return STATUS_FLT_DISALLOW_FAST_IO and we'll
                // get another shot at it on the IRP_MJ_CREATE path. 
                //

                return STATUS_FLT_DISALLOW_FAST_IO ;
            }
        }
    }

    return FLT_PREOP_SUCCESS_WITH_CALLBACK;
}
Again, this is very simplified to only show how to set the TargetInstance but there are a couple of things I'd like to point out. Because of how FastIO works (each driver calls the next driver passing parameters on the stack) FastIO doesn't have the problem described in my post on File IO Redirection Between Volumes Using FltMgr because there is no IRP and there are no IO_STACK_LOCATIONs (it is possible though to run out of thread stack but that can also be worked around). Also, in terms of referencing, please note that I'm keeping a reference to the instance context from preOp to postOp callback which in turn keeps the instance pointed by instanceContext->InstanceToRedirectTo around (though of course there are multiple different ways to achieve the same result).
So anyway, the code I have works fine in Win7 (after I disabled LUAFV because LUAFV always fails IRP_MJ_NETWORK_QUERY_OPEN with IRP_MJ_NETWORK_QUERY_OPEN; if you're wondering why I went through all the trouble because LUAFV will be running on all Vista and Win7 machines anyway then let me remind you that server SKUs don't have LUAFV running so there are machines out there running the Win7 kernel without LUAFV in the picture so my code might actually help them; also as you expect performance is a much bigger concern for servers). However, on WinXP SP3 I kept getting STATUS_OBJECT_NAME_NOT_FOUND (and the other statuses that indicate that the file isn't there) but the file was definitely present. Having tested that Win7 worked I started to wonder whether there was something different in WinXP that I needed to worry about. So I decided to see whether the request makes it to the right volume after all:
1: kd> kn L5 
 # ChildEBP RetAddr  
00 f53f293c f8477888 myfilter!PreNetworkQueryOpen // this is my preOp callback
01 f53f299c f84791a7 fltMgr!FltpPerformPreCallbacks+0x2d4 // this calls the preOp callbacks
02 f53f29b4 f8485c7a fltMgr!FltpPassThroughFastIo+0x3b // this is FLtMgr's function to process FastIO operations
03 f53f29f8 f83d6f70 fltMgr!FltpFastIoQueryOpen+0xf4 // FltMgr's FastIO callback for this operation
04 f53f2a18 805830fe sr!SrFastIoQueryOpen+0x40 // SR is issuing the request
1: kd> ?? Data // we need the address of the FLT_CALLBACK_DATA
struct _FLT_CALLBACK_DATA * 0x81b49684
   +0x000 Flags            : 2
...
1: kd> dt 0x81b49684 fltmgr!_FLT_CALLBACK_DATA Iopb->TargetInstance // See what is the instance the request was originally going to 
   +0x008 Iopb                 : 
      +0x00c TargetInstance       : 0x820d3008 _FLT_INSTANCE
1: kd> dt 0x820d3008 fltmgr!_FLT_INSTANCE Volume // get the volume from the instance 
   +0x018 Volume : 0x8237e5c0 _FLT_VOLUME
1: kd> dt  0x8237e5c0 fltmgr!_FLT_VOLUME DeviceObject // get FltMgr's DEVICE_OBJECT from the volume
   +0x01c DeviceObject : 0x823dac70 _DEVICE_OBJECT
1: kd> !devstack 0x823dac70 // see what's the bottom DEVICE_OBJECT for this volume. 
  !DevObj   !DrvObj            !DevExt   ObjectName
  823637a8  \FileSystem\sr     82363860  
> 823dac70  \FileSystem\FltMgr 823dad28  
  822fe020  \FileSystem\Ntfs   822fe0d8  // so we have NTFS on the bottom
1: kd> bp /t @$thread f8477888 // ok, now let's step out of my preOp callback on this thread and see what we change the instance to 
1: kd> bl
 0 e f8477888     0001 (0001) fltMgr!FltpPerformPreCallbacks+0x2d4
     Match thread data 81a3cbe8

1: kd> g
Breakpoint 0 hit
fltMgr!FltpPerformPreCallbacks+0x2d4:
f8477888 83f802          cmp     eax,2
1: kd> bc 0
1: kd> dt 0x81b49684 fltmgr!_FLT_CALLBACK_DATA Iopb->TargetInstance // it's the same FLT_CALLBACK_DATA but the instance should be different
   +0x008 Iopb                 : 
      +0x00c TargetInstance       : 0x820d9008 _FLT_INSTANCE
1: kd> dt  0x820d9008 fltmgr!_FLT_INSTANCE Volume // get the volume for the new instance
   +0x018 Volume : 0x820ebae0 _FLT_VOLUME
1: kd> dt 0x820ebae0 fltmgr!_FLT_VOLUME DeviceObject // get the DEVICE_OBJECT for the volume
   +0x01c DeviceObject : 0x820ebee8 _DEVICE_OBJECT
1: kd> !devstack 0x820ebee8 // see what's the bottom DEVICE_OBJECT… again, NTFS… 
  !DevObj   !DrvObj            !DevExt   ObjectName
  820eb020  \FileSystem\sr     820eb0d8  
> 820ebee8  \FileSystem\FltMgr 820ebfa0  
  820ea020  \FileSystem\Ntfs   820ea0d8  
1: kd> bp /t @$thread Ntfs!NtfsNetworkOpenCreate // ok, put a break on NTFS's function that processes this FastIO on this thread 
1: kd> g
Breakpoint 0 hit
Ntfs!NtfsNetworkOpenCreate:
f834ffb8 6878010000      push    178h
1: kd> bc 0
1: kd> kb L5 // show us the stack with parameters so we can see which device the request was actually sent to.
ChildEBP RetAddr  Args to Child              
f53f2968 f84790e8 81a6c380 f53f2c00 822fe020 Ntfs!NtfsNetworkOpenCreate // what do you know, it's the original DEVICE_OBJECT: 822fe020
f53f2988 f84791e4 000000f2 00000000 81b496c0 fltMgr!FltpPerformFastIoCall+0x300
f53f29b4 f8485c7a 003f29d8 823637a8 81a6c510 fltMgr!FltpPassThroughFastIo+0x78
f53f29f8 f83d6f70 81a6c380 f53f2c00 823dac70 fltMgr!FltpFastIoQueryOpen+0xf4
f53f2a18 805830fe 81a6c380 f53f2c00 823637a8 sr!SrFastIoQueryOpen+0x40
So what I did was to get the FLT_CALLBACK_DATA at the beginning of my callback and from that extract the file system's DEVICE_OBJECT on which the original request was sent. Then I let my callback run and I checked what the new stack instance was and got the file system's DEVICE_OBJECT on that stack. Then I simply let the request go until it hit the file system (NTFS on both volumes in this case) and then on the stack I can see which DEVICE_OBJECT the request was actually sent to. And, as I suspected, the request was sent on the original DEVICE_OBJECT and not the DEVICE_OBJECT for the instance I switched to. But why ? What should I have changed to make the request go where I wanted ? With some stepping through the code and reading a bunch of assembly I got to this part:
1: kd> u fltMgr!FltpPassThroughFastIo+0x55 L0xE
fltMgr!FltpPassThroughFastIo+0x55:
f84791c1 8b0f            mov     ecx,dword ptr [edi] // what is EDI
f84791c3 8b4664          mov     eax,dword ptr [esi+64h] // what is ESI ?
f84791c6 8d5e68          lea     ebx,[esi+68h]
f84791c9 53              push    ebx
f84791ca ff711c          push    dword ptr [ecx+1Ch]
f84791cd 8d4810          lea     ecx,[eax+10h]
f84791d0 ff7640          push    dword ptr [esi+40h]
f84791d3 51              push    ecx
f84791d4 33c9            xor     ecx,ecx
f84791d6 8a4805          mov     cl,byte ptr [eax+5]
f84791d9 0fb64004        movzx   eax,byte ptr [eax+4]
f84791dd 51              push    ecx
f84791de 50              push    eax
f84791df e804fcffff      call    fltMgr!FltpPerformFastIoCall (f8478de8)
1: kd> !pool @esi 2
Pool page 81b49628 region is Nonpaged pool
*81b49620 size:  108 previous size:   18  (Allocated) *FMic
  Pooltag FMic : IRP_CTRL structure, Binary : fltmgr.sys
1: kd> r @edi
edi=f53f29d8 // this is an address on the current stack
1: kd> dp f53f29d8
f53f29d8  8237e5c0 00000000 81b49628 ffffffff // so this structure has a pointer to the FLT_VOLUME and IRP_CTRL.. Must be the IRP_CALL_CTRL
f53f29e8  00000000 00000000 000001b4 0000493e
1: kd> dt @edi fltmgr!_IRP_CALL_CTRL
   +0x000 Volume           : 0x8237e5c0 _FLT_VOLUME
   +0x004 Irp              : (null) 
   +0x008 IrpCtrl          : 0x81b49628 _IRP_CTRL
   +0x00c StartingCallbackNode : 0xffffffff _CALLBACK_NODE
   +0x010 OperationStatusCallbackListHead : _SINGLE_LIST_ENTRY
   +0x014 Flags            : 0 (No matching name)
1: kd> dt fltmgr!_FLT_VOLUME
   +0x000 Base             : _FLT_OBJECT
   +0x014 Flags            : _FLT_VOLUME_FLAGS
   +0x018 FileSystemType   : _FLT_FILESYSTEM_TYPE
   +0x01c DeviceObject     : Ptr32 _DEVICE_OBJECT
….
So as you can see it looks like FltMgr picks the DEVICE_OBJECT from the IRP_CALL_CTRL->Volume structure. Let's see what happens in Win7:
0: kd> u fltmgr!FltpPassThroughFastIo+0x5a L0xD
fltmgr!FltpPassThroughFastIo+0x5a:
96019198 8b4668          mov     eax,dword ptr [esi+68h] // offset 0x68 where we had 0x64 in XP
9601919b 8d5e6c          lea     ebx,[esi+6Ch] // offset 0x6C where we had 0x68 in XP… did the IRP_CTRL change ?
9601919e 832300          and     dword ptr [ebx],0
960191a1 53              push    ebx
960191a2 ff763c          push    dword ptr [esi+3Ch] // and then there is a push for IRP_CTRL+0x3c instead of IRP_CALL_CTRL->Volume+0x1c..
960191a5 8d4810          lea     ecx,[eax+10h]
960191a8 ff7640          push    dword ptr [esi+40h]
960191ab 51              push    ecx
960191ac 0fb64805        movzx   ecx,byte ptr [eax+5]
960191b0 0fb64004        movzx   eax,byte ptr [eax+4]
960191b4 51              push    ecx
960191b5 50              push    eax
960191b6 e803fcffff      call    fltmgr!FltpPerformFastIoCall (96018dbe)
1: kd> dt fltmgr!_IRP_CTRL
   +0x000 Type             : _FLT_TYPE
   +0x004 Flags            : _IRP_CTRL_FLAGS
   +0x008 MajorFunction    : UChar
   +0x009 Reserved0        : UChar
   +0x00a CompletionStackLength : UChar
   +0x00b NextCompletion   : UChar
   +0x00c CompletionStack  : Ptr32 _COMPLETION_NODE
   +0x010 SyncEvent        : _KEVENT
   +0x020 Irp              : Ptr32 _IRP
   +0x020 FsFilterData     : Ptr32 _FS_FILTER_CALLBACK_DATA
   +0x024 AsyncCompletionRoutine : Ptr32     void 
   +0x028 AsyncCompletionContext : Ptr32 Void
   +0x02c InitiatingInstance : Ptr32 _FLT_INSTANCE
   +0x030 PendingCallbackNode : Ptr32 _CALLBACK_NODE
   +0x030 StartingCallbackNode : Ptr32 _CALLBACK_NODE
   +0x034 preOp            : __unnamed
   +0x034 postOp           : __unnamed
   +0x038 PostCompletionRoutine : Ptr32     void 
   +0x03c DeviceObject     : Ptr32 _DEVICE_OBJECT // so we get the DEVICE_OBJECT from the IRP_CTRL
...
Ok, so what's going on is that in Win7 it looks like the DEVICE_OBJECT is taken from the IRP_CTRL (which is the internal FltMgr structure that hosts the FLT_CALLBACK_DATA), which makes sense since we change the TargetInstance in the FLT_CALLBACK_DATA. In XP the DEVICE_OBJECT is taken from the IRP_CALL_CTRL->Volume and I haven't been able to find any code path that updates the IRP_CALL_CTRL. So based on this I've decided that this is an XP bug and that I can't really work around it for WinXP (since there is no way to update the FLT_VOLUME inside the IRP_CALL_CTRL which are both undocumented btw..). So I've updated my code so that in WinXP it always returns STATUS_FLT_DISALLOW_FAST_IO.
Finally, there is one more aspect to discuss. It looks like FltpPassThroughFastIo is a generic handler for all FastIO routines and as such this problem might actually be affecting all FastIO in WinXP and not only IRP_MJ_NETWORK_QUERY_OPEN, so if you see that TargetInstance redirection isn't working then it might be this issue.

Thursday, October 20, 2011

Testing a Minifilter on More Filesystems: UDF and ExFAT

In this post I want to show a neat trick that allows testing of filters on other file systems with very little overhead. This is important because very often file system filters end up very dependent on the semantics of some specific file system only because it's very easy to test on just that one. For example, most filters are tested with NTFS and occasionally FAT. However, the world of windows file systems is larger and in most cases just a simple test is enough to expose bigger issues.

I'd also like to mention the little known fact that Alternate Data Streams (ADS) and hardlinks are not only available on NTFS but on UDF as well, which makes UDF quite useful for testing these features if a filter uses them. Though UDF is meant for optical media I'll show you how you can set up a local virtual volume very easily.

I'm going to use the VHD support in the OS, which means this works on Win7 (and newer OSes). I found virtual disks to be very useful for testing and they're easy to setup and automate. What follows is code that creates a dynamic VHD:

C:\Users\Me>diskpart
Microsoft DiskPart version 6.1.7600
Copyright (C) 1999-2008 Microsoft Corporation.
On computer: xxxxxxx

DISKPART> create vdisk file=c:\testUDFS.vhd maximum=10000 type=expandable

  100 percent completed

DiskPart successfully created the virtual disk file.

DISKPART> attach vdisk

  100 percent completed

DiskPart successfully attached the virtual disk file.

Now that we have a VHD, the next step is to partition and format it:

DISKPART> create partition primary

DiskPart succeeded in creating the specified partition.

DISKPART> format fs=UDF label="UDFSVol" quick

  100 percent completed

DiskPart successfully formatted the volume.

DISKPART> assign

DiskPart successfully assigned the drive letter or mount point.

DISKPART> exit

Leaving DiskPart...

And that's all it takes, you now have the 10GB writable UDFS volume that can be used for testing filters in unusual setups:

C:\Users\Me>fsutil fsinfo volumeinfo I:
Volume Name : UDFSVol
Volume Serial Number : 0x96b7dca5
Max Component Length : 254
File System Name : UDF
Supports Case-sensitive filenames
Preserves Case of filenames
Supports Unicode in filenames
Supports Named Streams
Supports Hard Links

Finally, exactly the same approach can be used to create an NTFS volume or an ExFAT volume, all that needs to change is the line to format the disk:

DISKPART> format fs=UDF label="UDFSVol" quick
or
DISKPART> format fs=ExFAT label="ExFATVol" quick
or
DISKPART> format fs=NTFS label="NTFSVol" quick

Thursday, October 13, 2011

The Delete Minifilter Sample

Update (3/9/2012): This is where you get the sample from: http://code.msdn.microsoft.com/windowshardware/Delete-File-System-b904651d.

As I mentioned in my previous post there is a new file system filter sample in the Win8 WDK, Delete. Unfortunately I don't think I can reproduce the sample code here (since it belongs to Microsoft) so you'll need a Win8 WDK to be able to follow along. This also makes this post rather dry and I'm sorry about that.
In my experience I've seen mainly three types of filters that need to interface with delete:
  • Layered FSD - these filters take complete control of some files or parts of the namespace and that means they need to implement deletion for those files as well. However, in that respect they are more similar to a file system than a filter (because they maintain the state of the file and thus are the authority for the state of the file) and so I think that the FastFat source code in the WDK is a better example for those.
  • Undelete Filters - these are filters that want to be able to "undelete" a file either by moving it instead of deleting it or by preventing the user from deleting it in the first place under certain circumstances. For example, if the "Recycle Bin" feature would be implemented in a filter then it would be in this category. One typical problem these filters run into is the inability to know if a file is going to be deleted or not by querying the file system about the status of the file (by looking at the FILE_STANDARD_INFORMATION->DeletePending flag) and the inability to actually reset the DeletePending flag for files that were opened with FILE_DELETE_ON_CLOSE. As I mentioned in my previous post this can be worked around by removing the FILE_DELETE_ON_CLOSE flag from the create and then by sending a FILE_DISPOSITION_INFORMATION from postCreate to set the delete disposition which enables the filter to query and reset the disposition at a later time.
  • Filters that need to know when a certain stream disappears from the system, for example filters that keep metadata about some streams on a file system (like encryption keys or sizes or the time they were last modified and so on). Such filters might want to change their state or remove some metadata when a stream they are tracking disappears from the file system. For these it might not matter how the user tries to delete a file and in some cases it might not even matter what the file name is (so they would still track files across renames or deletes to Recycle Bin), what they need to know is when a certain stream is gone from the underlying file system without the possibility for it to come back. This is what the Delete sample is trying to show.
So, to recap, the goal of the delete sample is to detect when files and streams disappear from the file system. Once this happens the filter simply prints a notification with the file (or stream) name on the debugger output.
One aspect to note is that the filter prints the name of a file after the file has been deleted. This means it can't query the file system for the name (since it doesn't make any sense to ask a file system for a name of a file that doesn't exist) and so how is the filter supposed to know the name of file ? It might be tempting to try to implement a mechanism to be involved in all the name operations and thus to be able to know what was the last name for the file after the file has been deleted but this is definitely not a trivial task. However, as I mentioned in my previous post about using names in file system filters, if a name is only needed for logging then it doesn't really matter whether the name is exactly in sync with the file system or not since the name will be consumed later (usually much later) after the event happened so the name will likely be out of sync with the file system anyway. This is especially true for delete operations since the file isn't even on the file system anymore. So the delete sample takes the approach of printing a name for the file or stream without trying too hard to make sure that that name is exactly the last name the file had in the file system (though it will be right in the vast majority of cases).
It's interesting to look at the actual implementation of how the name is generated and stored. If a stream is interesting (in other words, if there is a possibility that the stream will be deleted) then the name of the stream to be deleted is stored in the stream context as a referenced pointer to the FLT_FILE_NAME_INFORMATION structure that is populated during preCleanup. If the stream has been opened multiple times there will be multiple IRP_MJ_CLEANUPs that the filter receives and the code simply calls FltGetFileNameInformation every time and updates the FLT_FILE_NAME_INFORMATION structure so that the name that is stored is the latest name in the file system right before the last IRP_MJ_CLEANUP for the file. Another thing to note is that the name that is generated is the OPENED_NAME. As I've said before, getting a NORMALIZED_NAME is expensive and pretty much only really necessary when the name is to be compared with other names (or parts of name). In this case the name is intended to be "consumed" by someone looking at the debugger log and so the normalized name is not necessary.
One interesting issue that filters face when trying to keep track of the state in the file system is that the IO stack in NT is asynchronous and as such the order in which a minifilter sees requests is not necessarily the order in which the file system sees them. Let's use the example of two IRP_MJ_SET_INFORMATION calls that are racing down the IO stack, both trying to set a FILE_DISPOSITION_INFORMATION, one of the them with DeleteFile set to TRUE and the other one with DeleteFile set to FALSE. Moreover, they are racing in a way that the filter sees both preOp callbacks before it sees the postOp callback for either of them (in other words both requests are being processed by layers below the filter at the same time). When a filter sees these requests it might see the one that sets it to TRUE and then the one that sets it to FALSE and assume that the delete disposition was set and then reset and so the file won't be deleted. However, it's very possible that the file system will received the request that sets the delete disposition to FALSE before the one it sets it to TRUE and so it will delete the file. This is clearly not a frequent case but it can happen. What the Delete filter does in this case is to keep a counter of the number of in-flight FileDispositionInformation operations it has seen. If there is only ever one operation then the filter can know for sure what the state is in the file system and so it registers a postOp callback where it checks if the operation was successful and if it was then it updates the information it keeps in the stream context with the disposition. If the filter ever processes more than one FileDispositionInformation operation then it gives up on trying to figure out what the state of the flag is in the file system and it falls back to its default behavior where it tries to figure out if the file was deleted from the file system. This is a perfect example of how a minifilter can optimize the common case (where there is only one FileDispositionInformation operation issued at a time) but when it detects that it can't do that it must use other ways that are possibly less efficient.
This brings us to discussing how the minifilter can tell whether a file or stream was actually deleted. The minifilter relies on a couple of implementation details in NTFS, where NTFS knows that the file or stream has been deleted and it answers an IO request in a slightly different way to indicate that. For example, querying for FileStandardInformation after the file system has processed IRP_MJ_CLEANUP for a stream will fail with STATUS_FILE_DELETED if the stream was actually deleted. This requires one additional IO request to the file system (so it does have a performance hit) which makes it less optimal than the case where the filter can know for sure what the delete disposition is. In addition to this, if the delete was for an alternate data stream (ADS) it is possible (but not guaranteed) that the whole file will be deleted. So once the filter figures out the stream was actually deleted it must find out whether the whole file was deleted as well. Again, the filter relies on some undocumented behavior (or at least I'm not aware of it being documented anywhere): trying to get the OBJECT_ID for a file in postCleanup will return STATUS_FILE_DELETED as well if the whole file was deleted (please note that the call might still fail if there isn't an OBJECT_ID for the file, but it won't fail with STATUS_FILE_DELETED if the file hasn't been deleted). This doesn't work for transacted files and so in that case the filter tries to open a file by ID, which fails with STATUS_INVALID_PARAMETER if the file was been deleted inside a transaction.
Since I mentioned transactions it's interesting to see what the filter does in case of deletes inside transactions. It can use the same mechanism to detect delete operations like it would for non-transacted handles, but the additional complication is that a transaction can be rolled-back, undoing all the delete operations. So whenever the filter detects a file being deleted inside a transaction it will add that file to a list and when the transaction is finalized the filter checks to see if the transaction was rolled-back and if so it will notify that the streams that were previously deleted in that transaction have now come back.
In fact, even just the fact that the minifilter supports transactions makes in an interesting sample, since other than the minispy sample there is no example about how a minifilter that is transaction aware needs to be implemented, and the minispy sample simply lists the requests it sees and doesn't try to do anything beyond that.
Finally, another thing worth mentioning is how contexts are allocated during preCreate and set during postCreate (when necessary), which is different from the Ctx sample that tries to allocate the context during postCreate. This method has a couple of advantages. First, it allows the filter to fail an operation before the operation is seen by the file system if the filter can't get a context (not enough memory for example) instead of trying to deal with the failure after the operation happened, which is sometimes impossible because some operations can't be undone. Also, for operations where the postOp callback can be at DPC this allows the filter to allocate and set the context during preOp when the code isn't running at DPC and pass it in through the CompletionContext to the postOp callback which can update it and then just call FltReleaseContext() which is supported at DPC (please note that if context needs to be accessible at DPC then it must be allocate from non-paged pool).
Unfortunately the delete sample doesn't show how to deal with more complicated types of deletes (overwriting renames for example), but it's still a welcome addition to the set of filter samples anyway.

Thursday, October 6, 2011

File Deletion

I've finally managed to install the Win8 WDK preview and I had a look at what's new in the WDK. There are a some new things for file system filter developers and I plan to write about them in the coming weeks.

One the most obvious new things is a new sample, "Delete". The purpose of the sample is to output a message to the debugger when a file has been deleted from the file system. This sounds like an easy task but it really isn't. One of the reasons for that is that delete semantics are rather different at file system level than what one expects. This is one of those cases in file system filter development where everyone has previous experience with the concept but the concept is very different from the implementation. For example just consider the Recycle Bin. As you might expect, the Recycle Bin is not a file system concept but rather a Windows OS concept which is implemented on top of the file system. What actually happens when a file is deleted to the Recycle Bin is that the file is renamed into a special hidden folder on the volume so that the user can't see the file anymore. However, as far as the file system is concerned, the file is not deleted at all but instead it has a new name.

Things are even more interesting when looking about how file system handle deletes. From the user perspective things are not very complicated: there is a Win32 DeleteFile() function that can be used to delete a file by name. However, as we've discussed before on this blog, there is quite a lot involved in resolving a file name to an actual file in the file system and as such it makes sense to keep all that complexity in one place and that place is in the functions that deal with IRP_MJ_CREATE. This is why there is no "IRP_MJ_DELETE" request that takes a file path and deletes that file. Instead, the file system opens the file name using an IRP_MJ_CREATE. Once that is done the user has a handle to the file and they can delete it. This can be done by using the IRP_MJ_SET_INFORMATION request with the FileDispositionInformation information class, which has a corresponding structure that describes what information should be set:

typedef struct _FILE_DISPOSITION_INFORMATION {
  BOOLEAN DeleteFile;
} FILE_DISPOSITION_INFORMATION, *PFILE_DISPOSITION_INFORMATION;

According to the documentation, once a user has set this flag " the only legal operation by such a caller is to close the open file handle". However, it's possible that there are more handles open to the file when this request is sent to the file system and whoever has those handles open is not aware of the file having been deleted and so they will likely continue doing whatever it is they are doing. According to the documentation "A file marked for deletion is not actually deleted until all open handles for the file object have been closed and the link count for the file is zero". The documentation doesn't mention this but most (if not all) the operations that happen on the other handles will work just fine and once the last handle is closed the file is deleted from the file system (the space that was occupied by the file is reclaimed and the file data is lost). However, it is possible for any of the other handles to detect that the disposition has been set by issuing an IRP_MJ_QUERY_INFORMATION request with the FileStandardInformation information class, which returns the FILE_STANDARD_INFORMATION structure in which the DeletePending member will be set to TRUE if this or another handle has issued a delete request:

typedef struct _FILE_STANDARD_INFORMATION {
  LARGE_INTEGER AllocationSize;
  LARGE_INTEGER EndOfFile;
  ULONG         NumberOfLinks;
  BOOLEAN       DeletePending;
  BOOLEAN       Directory;
} FILE_STANDARD_INFORMATION, *PFILE_STANDARD_INFORMATION;

But there is an interesting twist to this. The "DeleteFile" member of the FILE_DISPOSITION_STRUCTURE is a BOOLEAN. This is interesting because it seems to suggest it could be set to false. If the delete disposition could only be set to TRUE then why even have a member in that structure because simply issuing the request would indicate to the file system that the file needs to be deleted. As it happens, the delete disposition can also be reset by calling the same IRP_MJ_SET_INFORMATION request with the FileDispositionInformation information class with the DeleteFile member set to FALSE. This means that the file will not be deleted from the file system once the final handle is closed, cancelling the previous request to delete the file. This call (to set DeleteFile to FALSE) will be successful regardless of whether the file had a delete disposition set or not. In fact, one can call to set and reset the disposition many times and whoever called last to set the disposition to either true or false will win.

So now let's talk about how this implemented inside the file system. As you might have gathered from the above behavior, it looks like the delete disposition is a flag (which is why one can set it and clear it as many times as they want and the last one to change it wins). Also, since it's possible to set the delete disposition on one handle and read it on another handle, it must mean that this is a per-stream flag. And indeed, if we look at the FastFat sample in the WDK we can see the function FatSetDispositionInfo() (under \src\filesys\fastfat\Win7\fileinfo.c) performs a bunch of checks to make sure it can delete the file and then if they all pass and it can delete the file then it does this (which confirms this is a flag in the FCB):

        SetFlag( Fcb->FcbState, FCB_STATE_DELETE_ON_CLOSE );
        FileObject->DeletePending = TRUE;

By looking at the FastFat source code we can see where the FCB_STATE_DELETE_ON_CLOSE is used and get a pretty clear picture of all the places where the fact that a file is about to be delete matters. However, let's just look at what happens during IRP_MJ_CLEANUP processing (the FatCommonCleanup() function in \src\filesys\fastfat\Win7\cleanup.c). The flag is largely ignored except during the last IRP_MJ_CLEANUP for the FCB (the UncleanCount == 1 check), when the file will be deleted if possible by calling FatDeleteDirent() which removes the corresponding directory entry.

There is yet another way to delete a file. IRP_MJ_CREATE allows a caller to open a file and specify that once the handle closes the file will should be deleted. This is achieved through the FILE_DELETE_ON_CLOSE option. Looking at our FastFat source at the create path (starting from FatCommonCreate() in \src\filesys\fastfat\Win7\create.c) we can see that the flag is translated into a CCB flag, CCB_FLAG_DELETE_ON_CLOSE. The CCB is unique per FILE_OBJECT so basically the FILE_OBJECT remembers that it was opened with the FILE_DELETE_ON_CLOSE flag. The question is, where is the CCB_FLAG_DELETE_ON_CLOSE flag converted into FCB_STATE_DELETE_ON_CLOSE ? A quick search shows that this happens in the IRP_MJ_CLEANUP path. This has a set of interesting implications. For example, since the FCB flag isn't set an IRP_MJ_QUERY_INFORMATION request with the FileStandardInformation information class will not return the DeletePending flag. Also, trying to set the DeleteFile flag to FALSE will have no effect since the FILE_DISPOSITION_INFORMATION structure only affects the FCB_STATE_DELETE_ON_CLOSE flag and not the CCB one.

So before this goes on for too long, I'll try to wrap up this by going over what this behavior means to filters. Please note that this behavior is the same for the NTFS file system, though according to this post on NTFSD this was not always the case. Anyway, here are the conclusions:

  • A file can be deleted by opening a handle and sending a FileDispositionInformation request with DeleteFile = TRUE.
  • Alternatively, a file can marked for deleting when the handle is opened (atomically with the IRP_MJ_CREATE operation) by setting the FILE_DELETE_ON_CLOSE create option.
  • Anyone that has a handle to a file can check whether the delete disposition is set by querying for FILE_STANDARD_INFORMATION and checking the DeletePending flag.
  • If a file is opened with the FILE_DELETE_ON_CLOSE option then there is nothing a filter driver can do to undo that and clear it from the FILE_OBJECT. Moreover, if the filter driver didn't see the IRP_MJ_CREATE request then it will be impossible to determine whether it had the FILE_DELETE_ON_CLOSE and so it will be impossible to know if the file will be deleted when the handle is closed. However, a filter that sees the IRP_MJ_CREATE can remove the FILE_DELETE_ON_CLOSE option before sending the request down to the file system and then the filter can call FltSetInformationFile to set the delete disposition, which achieves a similar behavior with what the user probably expects. Please note though that this is not identical to letting FILE_DELETE_ON_CLOSE through the file system and a filter implementing this approach might break some things that rely on that specific NTFS behavior (though no well-written code should rely on this particular implementation detail since it's not documented by Microsoft and so it could change in the future).
  • If a file is opened with FILE_DELETE_ON_CLOSE and when it is closed there is another handle for the same stream, then the CCB flag will be promoted to the FCB stream during IRP_MJ_CLEANUP but the file will not be deleted. This means that the filter (or whoever had the handle open) might be able to clear this flag if it wants in this case, thus preventing the file from actually being deleted.
  • An open that overwrites a file (IRP_MJ_CREATE with the create disposition FILE_SUPERSEDE, FILE_OVERWRITE or FILE_OVERWRITE_IF) can be considered a delete operation since the original file contents are lost, even though there will still be an entry in the file system with the same name.
  • A file can also be deleted by a rename operation (or a create hardlink operation), if the ReplaceIfExists member of the FILE_RENAME_INFORMATION structure is true. In this case the file will be removed without even being opened, similar in a way to the overwriting open case.

NTFS makes things even more interesting because it has a larger feature set. In particular, hardlinks change the discussion a bit because if a file has multiple hardlinks then delete removes only one of the links but the file still exists on the volume with the other links. NTFS also supports Alternate Data Streams and streams can be deleted independently from the whole file, but if the main data stream is deleted then the whole file is deleted with all the other streams. And finally NTFS supports transactions which means that even if a file is deleted in a transaction and the last IRP_MJ_CLEANUP finds the delete disposition set and it deletes the file (or any stream is deleted or a file is overwritten in a rename and so on) the transaction might rollback and the file will need to be put back the way it was before the transaction started. We'll see how the Delete filter handles these cases in the next post.