Thursday, December 30, 2010

About IRP_MJ_CREATE and minifilter design considerations - Part III

An interesting topic when discussing creates is the context (thread and process context) in which the create happens. This isn't really interesting from the OS perspective (since the OS always receives the request in the context of the requestor) but from a filtering perspective. In the previous post we discussed how the OS takes the request and eventually sends an IRP to the file system. There are some things to note:

  1. CREATE operations must be synchronized by the OS. I think this is true for any stateful protocol (and stateless protocols don't really have a CREATE operation anyway). The CREATE operation simply means "hey everyone, there will be some requests in this context for this object so you'd better set up your contexts so you know what we're talking about when you get the next request". So the requestor can't really do anything until the request is complete since they don't even have a handle. This means that the IO manager will pretty much execute in a single thread and when it needs to wait for some other service (like the FS) it will send a request (the IRP_MJ_CREATE IRP) and wait for it to come back.
  2. The FS stack however is layered. The implication of this is that while the user can treat the CREATE operation as synchronous, the layers involved in processing that create can't. For file system filters (legacy and minifilters), there are 3 distinct steps:
    1. Before the request makes to the minifilter (before the preCreate callback is called)
    2. After the request is seen by the minifilter, but before the minifilter knows the request has been completed by the lower layers (after the preCreate callback but before the postCreate callback)
    3. After the minifilter knows the request has completed, but before the IO manager knows about it (after the postCreate callback)
    This is important to understand because there are certain limitations, depending on what each layer of the OS knows about the request. For example, during a preCreate callback, the IO manager knows someone wants to open a file but the FS doesn't yet know about that file. So even though the minifilter has a FILE_OBJECT structure (which comes from the IO manager), trying to use it to request something from the FS (like reading or writing or even queries) cannot work since the FS has not yet seen the request and has no idea what the FILE_OBJECT is supposed to represent (the information about which stream on disk the FILE_OBJECT will represent is stored in the create IRP and not in the FILE_OBJECT). In a similar fashion, during the postCreate callback the filter knows how the FS handled the request (whether it was a successful request or not) but the IO manager doesn't, so trying to call a function that involves the IO manager for that FILE_OBJECT (for example ObOpenObjectByPointer, which will create a HANDLE given an OBJECT) will fail.
  3. FltMgr will also synchronize IRP_MJ_CREATE requests for a couple of reasons. From a minifilter perspective, this is beneficial because it simplifies the model quite a bit. In general synchronized operations are somewhat simpler to handle in the postOp callback but synchronizing every operation will have a negative impact on the system. So FltMgr won't synchronize by default any operation except CREATE, where there is no negative impact because the IO manager synchronizes it already. While this is guaranteed by documentation, minifilters should still always return FLT_PREOP_SYNCHRONIZE instead of FLT_PREOP_SUCCESS_WITH_CALLBACK for IRP_MJ_CREATE just so this behavior is made obvious.
  4. This brings us to the most important point. FltMgr documentation mentions in a bunch of different places that the postCreate callback will be called in the same context as the preCreate callback. In some cases I've this statement being interpreted as "FltMgr guarantees that the postCreate will be called in the same thread where the user request was issued". However, this is not the case. FltMgr makes no guarantees about what thread the preCreate callback will be called on, just that it will call postCreate on the same thread. What can happen is that a filter (legacy or minifilter) can return STATUS_PENDING for an IRP_MJ_CREATE and the continue the request on a different thread, in a different process altogether. This is a legal option and what happens is that the filter below the filter that returned pending will have its preCreate callback called on the new thread, in the new process context. This is a brief example of what happens in this case (let's say the FS will return STATUS_REPARSE):
    1. The IO manager receives the CREATE request on Thread1 and issues an IRP_MJ_CREATE on the same thread.
    2. FilterA (let's say it's a legacy filter) sees IRP_MJ_CREATE request on Thread1 and pends it and then sends it down on a different thread, Thread2 .
    3. MinifilterB (below FilterA) sees the IRP_MJ_CREATE request (i.e. minifilter B's preCreate callback is called) on Thread2, where it queues the request and returns FLT_PREOP_PENDING.
    4. MinifilterB then dequeues the request on a different thread (Thread3) and it sends it down (calls FltCompletePendedPreOperation with FLT_PREOP_SYNCHRONIZE for example)
    5. The FS receives the IRP_MJ_CREATE on Thread3, processes and discovers it is a reparse point and so it returns STATUS_REPARSE.
    6. FltMgr's completion routine gets called on Thread3 and since FltMgr knows the operation is synchronized, it simply signals Event2.
    7. FltMgr resumes the operation on Thread2 where it was waiting for the event and calls the postCreate callback for minifilterB.
    8. Minifilter B does whatever processing it does for STATUS_REPARSE and returns FLT_POSTOP_FINISHED_PROCESSING.
    9. FltMgr completes the request (we're still on Thread2).
    10. FilterA's IoCompletion routine gets called on Thread2 and FilterA performs whatever processing it needs before completing the IRP.
    11. the IO manager's IoCompletion routine gets called (still on Thread2), but the IO manager is synchronizing the operation so it signals Event1.
    12. IO manager's wait on Thread1 returns so the IO manager can inspect the result of the call. Since the FS returned STATUS_PENDING, it might return back to OB and restart parsing from there… This in turn might come down the same path and issue a new IRP_MJ_CREATE on Thread1 and so on...
    Here is a picture of what this would look like.
As you can see, it is impossible for a filter to guarantee that its preCreate callback will be called on the thread of the original request. So what can a file system filter (or a file system) do ? Well, there are largely three reasons why a file system (or filter) might care about the context of a certain operation:
  • The operation refers to some buffer and the VA is only valid in the process context of the originator.
  • The operation refers to some other variable that is process specific (for example , a handle), like IRP_MJ_SET_INFORMATION with FileRenameInformation or FileLinkInformation, where the parameters contain a handle.
  • The operation needs to evaluate security so it needs to know who is the requestor for the operation.
IRP_MJ_CREATE doesn't care about user buffers or other process dependent variables (they are all captured before getting to the IO manager) so file systems and filters don't need to worry about that. However, security is a really big part of IRP_MJ_CREATE processing so filters often need to know who is requesting the operation. However, as I mentioned in the previous post in this series, the security context is captured in nt!ObOpenObjectByName and sent in the IRP parameters (Parameters.Create.SecurityContext) and so the file system and the filters can simply use the context there to decide who is requesting the operation.
In conclusion, the fact that a filter can't guarantee that it will be called in the context of the thread where the original request was issued doesn't matter much.

Thursday, December 23, 2010

About IRP_MJ_CREATE and minifilter design considerations - Part II

Since we've discussed the concepts last time we can finally start looking at the debugger. Because we're mostly interested in the create operation from a filter perspective, I put a breakpoint on fltmgr!FltpCreate so that we can see exactly what the stack looks like when the request reaches a filter. Let's say we're trying open the file "C:\Foo\Bar.txt". Here is what the stack looks like.

00 9b5c5a70 828484bc fltmgr!FltpCreate
01 9b5c5a88 82a4c6ad nt!IofCallDriver+0x63
02 9b5c5b60 82a2d26b nt!IopParseDevice+0xed7
03 9b5c5bdc 82a532d9 nt!ObpLookupObjectName+0x4fa
04 9b5c5c38 82a4b62b nt!ObOpenObjectByName+0x165
05 9b5c5cb4 82a56f42 nt!IopCreateFile+0x673
06 9b5c5d00 8284f44a nt!NtCreateFile+0x34

In order to discuss the flow of the IO through the OS we're going to look at what each of these functions does.
nt!NtCreateFile
This is how the OS receives a request to open a file or a device (at this level there is no distinction between the two yet). NtCreateFile doesn't really do much, it's just a wrapper over an internal OS function (IopCreateFile). The file name here is something like "\??\C:\Foo\Bar.txt".
nt!IopCreateFile
This is the function to open a device (or a file) at the IO manager level. This is an internal function where most requests to open a file or a device end up (NtOpenFile, IoCreateFile and friends and so on). This is what happens here:
  1. The parameters for the operation are validated and checked to see if they make sense. Here is where STATUS_INVALID_PARAMETER is returned if you do something like ask for DELETE_ON_CLOSE but not ask for DELETE access… There are a lot of checks to validate the parameters, but no actual security or sharing checks.
  2. A very important structure is allocated, the OPEN_PACKET. This is an internal structure to the IO manager and it is the context that the IO manager has for this create. The create parameters are copied in initially. This is a structure that's available in the debugger:
    1: kd> dt nt!_OPEN_PACKET
        +0x000 Type             : Int2B
        +0x002 Size             : Int2B
        +0x004 FileObject       : Ptr32 _FILE_OBJECT
        +0x008 FinalStatus      : Int4B
        +0x00c Information      : Uint4B
        +0x010 ParseCheck       : Uint4B
        +0x014 RelatedFileObject : Ptr32 _FILE_OBJECT
        +0x018 OriginalAttributes : Ptr32 _OBJECT_ATTRIBUTES
        +0x020 AllocationSize   : _LARGE_INTEGER
        +0x028 CreateOptions    : Uint4B
        +0x02c FileAttributes   : Uint2B
        +0x02e ShareAccess      : Uint2B
        +0x030 EaBuffer         : Ptr32 Void
        +0x034 EaLength         : Uint4B
        +0x038 Options          : Uint4B
        +0x03c Disposition      : Uint4B
        +0x040 BasicInformation : Ptr32 _FILE_BASIC_INFORMATION
        +0x044 NetworkInformation : Ptr32 _FILE_NETWORK_OPEN_INFORMATION
        +0x048 CreateFileType   : _CREATE_FILE_TYPE
        +0x04c MailslotOrPipeParameters : Ptr32 Void
        +0x050 Override         : UChar
        +0x051 QueryOnly        : UChar
        +0x052 DeleteOnly       : UChar
        +0x053 FullAttributes   : UChar
        +0x054 LocalFileObject  : Ptr32 _DUMMY_FILE_OBJECT
        +0x058 InternalFlags    : Uint4B
        +0x05c DriverCreateContext : _IO_DRIVER_CREATE_CONTEXT
     
    This structure is pretty important to the flow of the IO operation but there is no way to access it as a developer so it's going to be just an important concept to remember later on.
  3. Finally, since we've copied all internal parameters and all the IO manager has at this point is an OB manager path (in the ObjectAttributes paramater to the call), it must call the OB manager to open the device (ObOpenObjectByName, see below).
  4. After ObOpenObjectByName returns this function cleans up and returns.
nt!ObOpenObjectByName
This the call to have the OB manager create a handle for object when we know the name. This isn't a public interface since 3rd party drivers only need to open objects that have their own create or open APIs (for example ZwCreateFile, ZwOpenKey, ZwOpenSection, ZwCreateSection, ZwOpenProcess and so on). Another thing to note about the OB APIs is that they fall largely into two classes:
  1. Functions that reference objects (that just operate on the reference count of objects), like ObReferenceObject, ObReferenceObjectByName and ObReferenceObjectByPointer.
  2. Function that create handles to object in addition to referencing them (which is called an "open"), like ObOpenObjectByName and ObOpenObjectByPointer.
Anyway, this is roughly what goes on in here:
  1. Capture the security context for this open, so that whoever needs to open the actual object can perform access checks. This also means that the file system itself doesn't rely on the thread context being the same and instead uses the context captured here. So minifilters should to the same when they care about the security context of a create.
  2. Call the actual function that looks up the path in the namespace (ObpLookupObjectName, see below)
  3. If ObpLookupObjectName was able to find an object then a handle is created for that object (since this was an open type function).
nt!ObpLookupObjectName
This is the function where the OB manager actually looks in the namespace for the path it needs to open (which at this point is still "\??\C:\Foo\Bar.txt"). One thing to note is that the OB namespace has a hierarchical structure, with DIRECTORY_OBJECT types of objects that hold other objects. The root of the namespace ("\") is such a DIRECTORY_OBJECT.
Anyway this is what happens in this function. The parsing starts at the root at the namespace, "\". This is a loop until we find the final object to return to the user or find that there is no object by that name (and therefore fail the request):
  1. If the current directory is the root directory then check if the name starts with "\??\" and make it point to the \GLOBAL?? directory. This is a hardcoded hack in IO manager (which is why calling "!object \" in WinDbg doesn't show a "??" folder). (so our name becomes "\GLOBAL??\C:\Foo\Bar.txt")
  2. Find the first component in the path (which is GLOBAL??) in the current directory.
  3. If the component found is a DIRECTORY_OBJECT, open it and continue parsing from that point using the rest of the name (in our case, "C:\Foo\Bar.txt" is the remaining name). Continue the loop with remaining path.
  4. if the object has a parse procedure, call that parse procedure and give it the rest of the path. if the parse procedure returns STATUS_REPARSE (and it hasn't reparsed too many times already), start again at the root of the namespace with the new name returned by the parse procedude. Otherwise the parse procedure should either return STATUS_SUCCESS and return an object or a failure status.
Some notable things are:
  • OB will do a case sensitive or a case insensitive search of the OB namespace, depending on the OBJ_CASE_INSENSITIVE flag that is passed into the OBJECT_ATTRIBUTES, which is why it's important to set this correctly when calling FltCreateFile in a filter (specifically from a NormalizeNameComponent callback) since if it's not correctly set the request might not make it down the IO stack at all
  • the OB namespace uses symlinks quite a lot. OB symlinks are a special type of object that has a string member that points to a different point in the namespace, and a parse procedure:
    0: kd> dt _OBJECT_SYMBOLIC_LINK
     nt!_OBJECT_SYMBOLIC_LINK
        +0x000 CreationTime     : _LARGE_INTEGER
        +0x008 LinkTarget       : _UNICODE_STRING
        +0x010 DosDeviceDriveIndex : Uint4B
     
    So in our example, when OB gets to "\GLOBAL??\C:" it discovers it is a symlink and it calls the parse procedure with the rest of the remaining name ("\Foo\Bar.txt"). In The symlink for "\GLOBAL??\C:" points to "\Device\HarddiskVolume2" and the symlink's parse procedure concatenates that name with the remaining path that it got and so the new name after the symlink is "\Device\HarddiskVolume2\Foo\Bar". See this:
    0: kd> !object \GLOBAL??\C:
     Object: 96f7f188  Type: (922b7f78) SymbolicLink
         ObjectHeader: 96f7f170 (new version)
         HandleCount: 0  PointerCount: 1
         Directory Object: 96e08f38  Name: C:
         Target String is '\Device\HarddiskVolume2'
         Drive Letter Index is 3 (C:)
     
    The parse procedure of a symlink always returns STATUS_REPARSE.
  • Once we get to the "\Device\HarddiskVolume2\Foo\Bar.txt" path, while parsing OB will find that "\Device\HarddiskVolume2" is a DEVICE_OBJECT type of object and that it has a parse procedure. The parse procedure for a DEVICE_OBJECT is IopParseDevice, so that function gets called.
  • Another thing to note that there is a limit to the number of times OB will reparse and each time it sees a STATUS_REPARSE counts against that limit (so it doesn't matter whether it was a reparse from a symlink or a DEVICE_OBJECT, everything counts). So it is possible to reparse to the point where OB won't reparse anymore.
nt!IopParseDevice
The name here is just "\Foo\Bar.txt" and the parse procedure gets a reference to the device where the path should be searched. This is where the difference between a file and a device becomes relevant. If there is no remaining path, this is treated as an open to the device. If there is a path, then this is assumed to be a file (or directory) open. This is a pretty involved function with many special cases. However, there are only a couple of steps that we're going to talk about:
  1. Get the context for this create, which is the OPEN_PACKET structure from before. This works because the OPEN_PACKET is IO manager's structure passed from IopCreateFile to IopParseDevice. This is important because this is a nice way to have context across calls through other subsystems (OB manager) and still keep context that is opaque to those subsystems. This isn't always the case unfortunately and whenever two subsystems share the same structure the architecture gets complicated.
  2. Check to see if a file system is mounted on this device and if not then mount it.
  3. Process the device hint if there was any.
  4. Allocate the IRP_MJ_CREATE irp
  5. Allocate the FILE_OBJECT that will represent the open file.
  6. Call the FastIoQueryOpen function (which minifilters see as the IRP_MJ_NETWORK_QUERY_OPEN). The IRP parameter to this call is the IRP that was just allocated.
  7. If the FastIoQueryOpen didn't work, send the full Irp to the file system stack by calling IoCallDriver.
  8. Wait for IRP to complete (i.e. the IRP is synchronized by the IO manager).
  9. If the request was a STATUS_REPARSE, then first check if it is a directory junction or a symlink and do some additional processing for those. Anyway, copy the new name to open from the FILE_OBJECT (the actual name to open is passed in and out this function through a parameter).
  10. If the status from the Irp was not a success status or it was a STATUS_REPARSE, cleanup the FILE_OBJECT and release the references associated with it. The irp is always released anyway.
  11. Return the status. If this was successful, the FILE_OBJECT will be the one used to represent the file.

This is a pretty high level view of the process but it should explain why some of the things we're going to talk in future posts work the way they do.

Thursday, December 16, 2010

About IRP_MJ_CREATE and minifilter design considerations - Part I

This is the first in a series of posts where I'll try to address various common questions about IRP_MJ_CREATE. My plan is to address the following topics:

  • What exactly is it that IRP_MJ_CREATE creates ? (a bit of rambling on one of my favorite topics, operating systems design)
  • Why is there no IRP_MJ_OPEN ? Surely MS could afford one more IRP :)...
  • Flow of a file open request through the OS.
  • What is the difference between a stream and a file from an FS perspective
  • What does STATUS_REPARSE do ?
  • What is name tunneling ? How does it affect creates ?
  • How to open the same stream as an existing FILE_OBJECT in a name-safe way.
  • What are stream file objects and why are they necessary ?
  • Various strategies to redirect a file open to a different file.
  • How to track a create when reparsing ?

In order to address this properly, I'd like to explain some things about operating systems. This is a rather dry topic but in my opinion the things I'm going to talk about are fundamental for understanding not only how IRP_MJ_CREATE works, but also why it works the way it does.

There are many ways to define an operating system but for this topic I think that a very useful way to describe it is as a hardware abstraction layer. It is a library of functions combined with a machine abstraction. As such, OS code is pretty much dedicated to either "abstract stuff that people use a lot" (allocate memory, create a window, draw strings and so on) or "hardware interaction code" (talk to the disk, talk to the memory controller hardware, talk to the graphics hardware). As such it should come as no surprise that the kernel part of OS is designed around interaction with hardware (as opposed to the user mode part which in general implements more abstract services).

File systems (and the whole file system stack including legacy filters and minifilters) are "higher level drivers" (since they don't usually talk to hardware directly). However, they must fit into the OS model which is built around hardware. This is why file system still create device objects and when calling FltGetFileNameInformation the name it returns starts with "\Device\....".

One other very important concept that plays into why IRP_MJ_CREATE functions the way it does is that the OS itself is implemented as a set of "services". Each service has its own protocol, usually described by an API set (the memory manager has it's own command set, the object manager has its own set and so does the IO manager). Most (if not all) of these protocols are stateful. The caller issues an "initialize" command (ExAllocatePool, ZwCreateFile, FltRegisterFilter) and they get back a more or less opaque handle (for ExAllocatePool, the pointer serves as the handle; ZwCreateFile -> an actual handle; FltRegisterFilters -> a PFLT_FILTER pointer and so on) and they can then issue additional commands that require that handle to be passed in (ExFreePool, ZwReadFile, FltStartFiltering). For stateful protocols the service (or server) has a blob of data that describes the internal state of each object and based on that data it knows how to satisfy each request. The opaque handle is a key that helps the service find that data. For example, for ExAllocatePool the internal data blob is the nt!_POOL_HEADER, for ZwCreateFile the context is pretty much a set of granted access rights for that handle and a reference to the FILE_OBJECT and for FltStartFiltering the FLT_FILTER structure. From this point on I'll call that blob of data a context (as in MM's context, IO manager's context, FltMgr's filter context). For services that already provide support for caller defined contexts (like FltMgr) I'll use the terms "internal context" and "user's context" to differentiate the two. The conclusion here is that any stateful protocol must have some context in the service (or server) side that the service can use to keep track of the state of communication with the client.

The important thing I wanted to get to is that sometimes some operations require multiple OS components to work together to satisfy a user request and as such multiple contexts might need to be created by each component. For example, for a ZwCreateFile call there might need to be created some of the following contexts: a handle, a FILE_OBJECT, a FltMgr internal context, some minifilter contexts, one or more file system contexts and a couple of MM contexts (where all the other contexts will be stored).

So with all these things in place, we can start talking about IRP_MJ_CREATE. As I said above, the OS has an abstract interface which consists mainly of OBJECTs for various things. When someone needs to talk to a device (physical or a virtual device, like a file system; anything that can be represented internally by a DEVICE_OBJECT), the OS context is a FILE_OBJECT. So in other terms, the FILE_OBJECT simply represents the state associated with the OS communicating to a DEVICE_OBJECT. The "create" word in ZwCreateFile and IRP_MJ_CREATE simply refers to FILE_OBJECT itself. There is no IRP_MJ_OPEN because there is no way to open an existing FILE_OBJECT. In order to get a FILE_OBJECT one must either create it or already have a reference to it (pointer or handle) and must call either ObReferenceObject or ObReferenceObjectByHandle to get another reference to that FILE_OBJECT.

The next topic, which is the flow of a create operation through the OS is pretty long so I'll save for next week. In the mean time please fell free to let me know what other topics related to the IRP_MJ_CREATE path you have that you'd like to address.

Thursday, December 9, 2010

More on IRPs and IRP_CTRLs

Sometimes I see posts on discussion lists about how a callback is not being called for some operation that a minifilter registered for. In most (possibly all) cases it turns out that that's not what the problem is and that the callback is in fact called, it's just that the poster can't tell it happened. It's happened to me a couple of times, but since I have a lot of confidence in FltMgr (having worked on it and all) I start of with the assumption that it must be something I'm doing wrong.


However, I've been wondering why people seem so keen on assuming that they don't get to see the callback for minifilters. And then I've realized that it might have something to do with the fact that minifilters use a callback model whereas the NT IO model is call-through. I'll talk a bit the call-through model and the limitations it has. I'll start with a brief refresh of the NT IO model and then explain the limitations and how the minifilter model tries to address them. Then I'll explain some of the downsides and how to work around them.


When an IO request (open a file, read or write and so on) reaches the IO manager, the information about the request is put in an IO request packet (IRP). Then the IO manager calls the driver that should process that IRP by calling IoCallDriver. There may be multiple drivers needed in order to complete a single operation, for example when the user opens a remote file so the IO request goes to a file system which then needs to send something to the network, so now there are at least two drivers involved in this. One could design the OS so that the drivers could go back to the IO manager and let it dispatch the request to the appropriate driver again or let the two drivers communicate directly. NT was designed to let the drivers communicate directly. Moreover, in many cases it one request may pass through many drivers that make up an IO stack (like the file system stack or the storage stack or the network stack), where each driver performs a specific role. So the IRP is potentially modified by each driver and sent to the next driver by calling IoCallDriver.


This is a call-through model. In the debugger it can sometimes look like this (please note that the IRP model allows the request to be completely decoupled from the thread but in practice you still see a lot of cases where a lot of drivers simply call the next driver in the same thread):


1: kd> kn
 # ChildEBP RetAddr  
00 a204bb10 828734bc volmgr!VmReadWrite
01 a204bb28 963bc475 nt!IofCallDriver+0x63
02 a204bb34 963bc548 fvevol!FveRequestPassThrough+0x31
03 a204bb50 963bc759 fvevol!FveReadWrite+0x4e
04 a204bb80 963bc7a9 fvevol!FveFilterRundownReadWrite+0x197
05 a204bb90 828734bc fvevol!FveFilterRundownWrite+0x33
06 a204bba8 9639a76e nt!IofCallDriver+0x63
07 a204bc88 9639a8a5 rdyboost!SmdProcessReadWrite+0xa14
08 a204bca8 828734bc rdyboost!SmdDispatchReadWrite+0xcb
09 a204bcc0 965a0fd9 nt!IofCallDriver+0x63
0a a204bce8 965a12fd volsnap!VolsnapWriteFilter+0x265
0b a204bcf8 828734bc volsnap!VolSnapWrite+0x21
0c a204bd10 960b091c nt!IofCallDriver+0x63
0d a204bd1c 828a711e Ntfs!NtfsStorageDriverCallout+0x14
0e a204bd1c 828a7215 nt!KiSwapKernelStackAndExit+0x15a
0f 981c964c 828c711d nt!KiSwitchKernelStackAndCallout+0x31
10 981c96c0 960af939 nt!KeExpandKernelStackAndCalloutEx+0x29d
11 981c96ec 960b05a6 Ntfs!NtfsCallStorageDriver+0x2d
12 981c9730 960af0a0 Ntfs!NtfsMultipleAsync+0x4d
13 981c9860 960ae0a6 Ntfs!NtfsNonCachedIo+0x413
14 981c9978 960af85f Ntfs!NtfsCommonWrite+0x1ebd
15 981c99f0 828734bc Ntfs!NtfsFsdWrite+0x2e1
16 981c9a08 9605f20c nt!IofCallDriver+0x63
17 981c9a2c 9605f3cb fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x2aa
18 981c9a64 828734bc fltmgr!FltpDispatch+0xc5
19 981c9a7c 82a74f6e nt!IofCallDriver+0x63
1a 981c9a9c 82a75822 nt!IopSynchronousServiceTail+0x1f8
1b 981c9b38 8287a44a nt!NtWriteFile+0x6e8
1c 981c9b38 828798b5 nt!KiFastCallEntry+0x12a
1d 981c9bd4 82a266a8 nt!ZwWriteFile+0x11

So here we can see how a write (ZwWriteFile) goes through FltMgr, NTFS, volsnap, rdyboost, fvevol and volmgr (where I set my breakpoint for this blog post).


One big problem with this approach is that the size of the kernel stack in NT is pretty small (depends on the architecture and so on but it's something like 12K or 20K..) and so if there are enough drivers, each of them using some stack space then it is possible to run out of stack. This in fact happens in some cases (AV filters were notorious for using a lot of stack) and the outcome is a bugcheck. Please note that in the example above, most filters were just letting the request pass through them, without necessarily doing anything to it. So they still use stack space even if they don't care about the operation at all…


Another problem with this approach is that it is almost impossible to unload a driver because very often each driver remembers which driver they need to send the IRP to next, so they are either referencing it (so it will never go away) or just using it without referencing it and so immediately after it goes away there is a bugcheck.


FltMgr's main goal was designed to increase system reliability (yeah, making file system filters development easier was just a secondary objective) and it tried to address this issue by making the minifilter model a callback model. This addresses both problems. Unloading a minifilter works because now each filter doesn't need to know which is the next filter to call and so the only component that must reference a minifilter is FltMgr, which then allows a minifilter to go away by informing only FltMgr about it.


The way this takes care of stack usage is a bit more interesting. When the minifilter callback is done it returns to FltMgr a status that instructs FltMgr whether they want to be notified when the request completes or not (or a couple of other statuses) but that's it. The stack space associated with the call to the minifilter's callback (the stack frame) is released and can be reused. This is why in the stack above, the IRP simply goes from IO manager to FltMgr and then to the filesystem. It doesn't matter how many minifilters were attached to the volume, they all use no stack space at all at this time.


Now, let's look in more detail at filter manager's stack frame. There are no minifilters functions on the frame because they all returned nicely to FltMgr and no longer use any stack space. This is the most confusing thing about this, that the minifilters cannot be seen on the stack so it looks like they have never been called at all… However, now that we know that FltMgr must have called some minifilters, is there a way to see which minifilters were called and so on ? In a previous post I explained that FltMgr has an internal structure that wraps the IRP called the IRP_CTRL (also known as a CALLBACK_DATA), and all the information about the request is stored in there. FltMgr clearly must remember the IRP_CTRL associated with this IRP someplace, but where ?


1: kd> kbn
 # ChildEBP RetAddr  Args to Child              
...
16 981c9a08 9605f20c 93460958 94301bf8 00000000 nt!IofCallDriver+0x63
17 981c9a2c 9605f3cb 981c9a4c 93460958 00000000 fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x2aa
18 981c9a64 828734bc 93460958 94301bf8 94301bf8 fltmgr!FltpDispatch+0xc5
19 981c9a7c 82a74f6e 93715f80 94301bf8 94301dac nt!IofCallDriver+0x63
...


Well, it turns out that there is another very useful structure called the IRP_CALL_CTRL, which is a structure that associates an IRP and an IRP_CTRL and other context that FltMgr keeps for the operation:


1: kd> dt 981c9a4c fltmgr!_IRP_CALL_CTRL
   +0x000 Volume           : 0x932f1008 _FLT_VOLUME
   +0x004 Irp              : 0x94301bf8 _IRP
   +0x008 IrpCtrl          : 0x93591de0 _IRP_CTRL
   +0x00c StartingCallbackNode : 0xffffffff _CALLBACK_NODE
   +0x010 OperationStatusCallbackListHead : _SINGLE_LIST_ENTRY
   +0x014 Flags            : 0x204 (No matching name)

From here we can see the IRP_CTRL pointer and call my favorite extension, !fltkd (I get a complaint on my current symbols about how the PVOID type is not defined, which I've edited out):



1: kd> !fltkd.irpctrl 0x93591de0

IRP_CTRL: 93591de0  WRITE (4) [00000001] Irp
Flags                    : [10000004] DontCopyParms FixedAlloc
Irp                      : 94301bf8 
DeviceObject             : 93460958 "\Device\HarddiskVolume2"
FileObject               : 93715f80 
CompletionNodeStack      : 93591e98   Size=5  Next=1
SyncEvent                : (93591df0)
InitiatingInstance       : 00000000 
Icc                      : 981c9a4c 
PendingCallbackNode      : ffffffff 
PendingCallbackContext   : 00000000 
PendingStatus            : 0x00000000 
CallbackData             : (93591e40)
 Flags                    : [00000001] Irp
 Thread                   : 93006020 
 Iopb                     : 93591e6c 
 RequestorMode            : [00] KernelMode
 IoStatus.Status          : 0x00000000 
 IoStatus.Information     : 00000000 
 TagData                  : 00000000 
 FilterContext[0]         : 00000000 
 FilterContext[1]         : 00000000 
 FilterContext[2]         : 00000000 
 FilterContext[3]         : 00000000 

   Cmd     IrpFl   OpFl  CmpFl  Instance FileObjt Completion-Context  Node Adr
--------- -------- ----- -----  -------- -------- ------------------  --------
 [0,0]    00000000  00   0000   00000000 00000000 00000000-00000000   93591fb8
     Args: 00000000 00000000 00000000 00000000 00000000 0000000000000000
 [0,0]    00000000  00   0000   00000000 00000000 00000000-00000000   93591f70
     Args: 00000000 00000000 00000000 00000000 00000000 0000000000000000
 [0,0]    00000000  00   0000   00000000 00000000 00000000-00000000   93591f28
     Args: 00000000 00000000 00000000 00000000 00000000 0000000000000000
 [0,0]    00000000  00   0000   00000000 00000000 00000000-00000000   93591ee0
     Args: 00000000 00000000 00000000 00000000 00000000 0000000000000000
 [4,0]    00060a01  00   0002   9341d918 93715f80 9608e55e-2662d614   93591e98
            ("FileInfo","FileInfo")  fileinfo!FIPostReadWriteCallback 
     Args: 00020000 00000000 003a0000 00000000 92fc6000 0000000000000000
Working IOPB:
>[4,0]    00060a01  00          9341d918 93715f80                     93591e6c
            ("FileInfo","FileInfo")  
     Args: 00020000 00000000 003a0000 00000000 92fc6000 0000000000000000

Here we can see what the minifilter stack looks like and that the FileInfo minifilter wanted a postOp callback for this operation. Another thing we can do is this (using the FLT_VOLUME pointer from the IRP_CALL_CTRL):



1: kd>  !fltkd.volume 0x932f1008

FLT_VOLUME: 932f1008 "\Device\HarddiskVolume2"
   FLT_OBJECT: 932f1008  [04000000] Volume
      RundownRef               : 0x00000074 (58)
      PointerCount             : 0x00000001 
      PrimaryLink              : [9334f404-932ad9b4] 
   Frame                    : 930adcc0 "Frame 0" 
   Flags                    : [00000064] SetupNotifyCalled EnableNameCaching FilterAttached
   FileSystemType           : [00000002] FLT_FSTYPE_NTFS
   VolumeLink               : [9334f404-932ad9b4] 
   DeviceObject             : 93460958 
   DiskDeviceObject         : 932b2320 
   FrameZeroVolume          : 932f1008 
   VolumeInNextFrame        : 00000000 
   Guid                     : "" 
   CDODeviceName            : "\Ntfs" 
   CDODriverName            : "\FileSystem\Ntfs" 
   TargetedOpenCount        : 55 
   Callbacks                : (932f109c)
   ContextLock              : (932f12f4)
   VolumeContexts           : (932f12f8)  Count=0
   StreamListCtrls          : (932f12fc)  rCount=2630 
   FileListCtrls            : (932f1340)  rCount=0 
   NameCacheCtrl            : (932f1388)
   InstanceList             : (932f1058)
      FLT_INSTANCE: 94114498 "luafv" "135000"
      FLT_INSTANCE: 9341d918 "FileInfo" "45000"


From here we can tell that there are in fact two minifilters attached to this frame , luafv and fileinfo. We knew about fileinfo from the IRP_CTRL, but what about luafv ? Did it even get called ? Well, unfortunately the only thing we can know for sure is that luafv was registered with fltmgr and attached to this volume. They might not have a callback registered for WRITEs or that callback was called but they returned FLT_PREOP_SUCCESS_NO_CALLBACK, so fltmgr didn't use a completion node for it so there is no record of it… We can look at the filter and see the registered callbacks, but we might not be able to find a record of whether the callback was actually called..

Thursday, December 2, 2010

More thoughts on FltDoCompletionProcessingWhenSafe and minifilter completion in general

I promised in the last post that I'd talk about how FltDoCompletionProcessingWhenSafe can deadlock. I've never actually seen such a deadlock so I've spent some time thinking about it and I went over various scenarios but in the end I couldn't find anything specific to FltDoCompletionProcessingWhenSafe.

However, thinking about deadlocks in the completion path there is a way a deadlock can happen anyway, so I'll write about that instead and explain how I think this works with FltDoCompletionProcessingWhenSafe :).

There are some drivers that take the approach of queuing up requests and then using one or more threads to dequeue the requests and process them. In theory this can happen anywhere, in a minifilter, in the file system and in the storage stack. In fact the ramdisk sample in the WDK is implemented using such a queue (at least, as far as I can tell, WDF is not my forte). Anyway, the point to remember is that this is a fairly common design strategy, possibly even more so with storage drivers.

This will be easier to explain with an example, so I'll describe a possible architecture for a storage driver. This driver mark all requests as pending, it queues them to an internal queue, it then releases a semaphore (or some similar mechanism) after which it returns pending to the caller. The driver also has one thread that waits on the semaphore and then when it is signaled it dequeues one request and processes it synchronously (it waits for it to complete), after which it calls IoCompleteRequest and goes back to waiting. Pretty simple, right ? For this discussion I'll simplify things by making the storage driver never actually call IoCompleteRequest at DPC so that is not an issue..

Now, here is where a minifilter enters the picture. Let's say I need a minifilter that performs some sort of logging and after each successful operation (or unsuccessful, it doesn't matter I'm just trying to find something plausible a minifilter would do) it writes a record to a log file. So their postOp routine does something like this:

If (NT_SUCCESS(status)) FltWriteFile(..., logEntry, ..);

Now, let's say that because the minifilter write expects that there are multiple threads writing at the same time, it is easier to open the log file for synchronous IO and not worry about maintaing the current byte offset and so on. Which means that it will issue a synchronous write (if no CallbackRoutine is provided when calling FltWriteFile then the write will be a synchronous one).

If these implementations happen to meet on a machine, here's how a deadlock might happen:

User Thread (issuing a read operation for example) :
1. Minifilter gets called and it wants to log the operation and so it returns FLT_PREOP_STATUS_SUCCESS_WITH_CALLBACK
2. The file system receives the operation and doesn't do much (let's say it's a small non-cached read) and sends it down to the storage device.
3. The storage device pends the IRP_MJ_READ and adds it to the queue.

Storage Driver Thread
1. Get notification about the pended IRP_MJ_READ and dequeue it
2. Perform the operations associated with the request (read from an internal buffer, queue a DMA transfer or do whatever it is that storage drivers do when they need to read data :)).
3. Call IoCompleteRequest on the IRP_MJ_READ
4. The file systems' IoCompletionRoutine gets called, which doesn't do much and returns STATUS_SUCCESS
5. The minifilter's postOp callback gets called
6. The minifilter calls FltWriteFile(…logEntry….)
7. FltMgr sends an IRP_MJ_WRITE to the file system.
8. The Storage Driver gets an IRP_MJ_WRITE and it queues it and returns STATUS_PENDING.
9. FltMgr gets the STATUS_PENDING and since the caller wanted a synchronous write, it waits for the IRP to complete.. However, since this is the Storage Driver Thread already, it will never dequeue the request and it will deadlock.

Now, this might look like a pretty forced scenario (which it is :)), but it's to describe what the problem looks like. So now let's discuss how a more "real-world" scenario would look like and how some different design decisions might affect this outcome:

  • What if the storage driver had multiple threads (can we blame the writer of the storage driver)? Clearly this would help the scenario. But then even when there are multiple threads, there are some operations that likely need to be synchronized. For example, maybe the storage driver can perform multiple reads but only one write at one time.. This would solve the issue because the minifilter would issue the request from one of the reader threads and it would wait for the writer thread.. But what if the minifilter did the same thing for IRP_MJ_WRITEs ? The problem is still there. 
  • What if the driver supports multiple threads for both reads and writes ? Well, there is likely some operation that requires synchronization. For example, for a VHD storage driver (a dynamic VHD extends in blocks so when a new block is needed, metadata operations need to happen so some synchronization is required) might have multiple threads for IRP_MJ_READs and IRP_MJ_WRITEs but if the IRP_MJ_WRITE is an extending one (i.e. when a new block must be allocated), it might still queue the IRP_MJ_WRITE to a single "extending write" processing thread. So now the deadlock would happen only when the user's write would require the VHD to extend and when the minifilter's log write is also an extending one.. 
  • And even if there are multiple threads that are completely independent, if there are enough simultaneous requests or if there are enough minifilters blocking those threads, this might still happen.
  • What if the minifilter issued an asynchronous request and just waited for it for complete ? Well, this is largely equivalent to issuing a synchronous request so the issue is still there.

It might seem that this scenario simply can't work and issuing a write from a completion routine is always deadlock prone but there some things that could fix this problem, so let's talk about them as well:

  • The minifilter could issue a completely asynchronous request and NOT WAIT for it. This can work for logging since it might not matter when the logging happens, so the minifilter doesn't actually need to wait. But what if the minifilter is not just logging but is doing something that simply must complete before the original request completes ? Then the minifilter can simply issue the asynchronous request and return FLT_POSTOP_MORE_PROCESSING_REQUIRED and then in the CompletionRoutine complete the request. This would work because when FLT_POSTOP_MORE_PROCESSING_REQUIRED is returned, control is returned to where IoCompleteRequest was called, which was right the Storage Driver Thread called IoCompleteRequest. So now the Storage Driver Thread is no longer blocked and can go back to processing more IO (this is very similar to what FltDoCompletionProcessingWhenSafe does).
  • What if the minifilter doesn't want to issue an asynchronous request, since synchronous requests are much easier to handle ? Then the minifilter could queue the synchronous request to a worker thread and return FLT_POSTOP_MORE_PROCESSING_REQUIRED and have the worker thread complete the user's request after the synchronous request it issued completes.
  • And yet another approach a minfilter can take is to return FLT_PREOP_SYNCHRONIZE instead of FLT_PREOP_SUCCESS_WITH_CALLBACK. This means that once the request is completed in the storage driver, FltMgr will simply acknowledge that completion and not block that thread at all. This has the added benefit of executing in the context of the original request, which is usually a much better idea for minifilters that need to do complicated things in their postOp routines.

Now, the reason this is not specific to FltDoCompletionProcessingWhenSafe is because I already asserted that the storage driver never completes a request at DPC so calling FltDoCompletionProcessingWhenSafe is unnecessary. However, even if the storage driver could call IoCompleteRequest at DPC, FltDoCompletionProcessingWhenSafe would simply return STATUS_MORE_PROCESSING_REQUIRED so the thread where IoCompleteRequest would not be blocked. Besides, that thread would likely be an arbitrary thread anyway (since completion at DPC usually happens in whatever thread happened to be running when the request was completed by the hardware.. ) . Anyway, there are other more complicated reasons why this in fact simply can't happen when the thread actually completes at DPC (or at least I don't think so) but I won't go into that now.

However, one thing to keep in mind is that if completion doesn't actually happen at DPC, FltDoCompletionProcessingWhenSafe doesn't do anything more than call the user's completion function inline so the deadlock I described above can still happen.

So I guess the bottom line is that the warning that provoked this post should in fact be something more like :
Caution   To avoid deadlocks, minifilters should not perform synchronous requests from a postOp callback and should instead either:

  • queue the operation and return FLT_POSTOP_MORE_PROCESSING_REQUIRED from the postOp callback or
  • return FLT_PREOP_SYNCHRONIZE from the preOp

I hope this makes sense. Please feel free to comment on anything I might have missed (since this is a pretty complicated scenario and I haven't in fact ever seen this in practice so it's all hypothetical :) ).

Friday, November 12, 2010

Some thoughts on FltDoCompletionProcessingWhenSafe

I've been meaning to talk about this for a while. There is a warning in the MSDN page for FltDoCompletionProcessingWhenSafe which is pretty interesting:

Caution   To avoid deadlocks, FltDoCompletionProcessingWhenSafe cannot be called for I/O operations that can be directly completed by a driver in the storage stack, such as the following:
• IRP_MJ_READ
• IRP_MJ_WRITE
• IRP_MJ_FLUSH_BUFFERS

Let's start by looking a bit at how file systems handle requests. There are multiple ways in which file systems can complete user requests, but largely they fall into a few cases. I'd like to point out that I'm simplifying things here, there are many ways in which file systems might handle operations and the same goes for storage devices… What I'm describing is not an exhaustive list of how things happen in a file system and storage stack, but rather a plausible way in which they can happen in some file systems in some cases:
• Synchronous - when all the data is readily available then the file system doesn't need to do any additional steps and can just perform the operation and return to the caller. For example, when setting the delete disposition on a file, the file system only needs to access the FCB and set the flag (because the delete disposition is a flag on the FCB). If the file system can acquire the FCB immediately it can just set the flag to whatever disposition the caller wanted, release the FCB and call IoCompleteRequest. When this happens the completion routines (and the postOp callbacks for minifilters) are actually called in the same thread as the original operation, at the same IRQL (which is very likely at PASSIVE_LEVEL)...
• Queued (asynchronous) - this happens when the file system realizes it can't complete the operation immediately and it needs to pend the request and complete it when some condition occurs. There are a lot of cases when this happens, for example when the file system needs to acquire some resource and it doesn't want to wait for it inline. Another case where this is pretty much the only course of action is when the caller registers notifications for something (oplocks, directory changes and such) and the IRP gets pended. In these cases, the postOp callbacks will be called generally in the context of the thread that released the resource or that did something to trigger the notification (acknowledge an oplock break, rename a file and so on). This is usually a different thread from the original thread the request came in, and usually the IRQL is <= APC_LEVEL.
• Forwarded - this can happen when the file system needs to get some data from the storage device and it simply forwards the request the underlying device. For example let's say that a user wants to read some aligned data from a file. The file system might simply calculate where the data begins on disk (by consulting its allocation maps which we'll assume are cached so no reading from the device is necessary), change the offset in the IRP_MJ_READ parameters to the right sector where the data is located, then lock the buffer in memory and then call IoCallDriver. When this request will be satisfied by the storage stack, it will call IoCompleteRequest and the file system will pretty much not do anything (or free some resources or some such) and then let the request go up. In this case, the thread in which the postOp callback gets called is the thread that was running when the disk IO was completed by the device (the IO will be completed in an interrupt, which will likely queue a DPC, which will then execute in whatever thread context the CPU happened to be running when the interrupt triggered) and at DPC_LEVEL.

Now, in a lot of cases the file system will need to perform a bunch of things in response to one single user request. For example, a request to write something might mean the file system will need to do at least the following (please ignore the order of the operations here):
• Write the data
• Update the last access time
• Update the file size
All these changes need to be saved to different places on disk (usually, it really depends on the filesystem) so the request might be pended by the file system while it issues a bunch of different IO requests to the storage device and when all of them complete it can complete the request. So in most cases operations are a combination of queued and forwarded operations.

The reason I went into all of this was because I wanted to make this point: in most cases, the postOp callback will be called at DPC only if the operation required one or more IOs to be sent to the storage device and the filesystem didn't need to synchronize the operation back to some internal thread and instead simply had a passthrough completion routine (see FatSingleAsyncCompletionRoutine in the FASTFAT sample ). The file system will not usually complete an operation at DPC in other cases (again, different file system do things differently so it MIGHT still happen).

Now, this means that the either warning or the function are useless, because the only reason the FltDoCompletionProcessingWhenSafe exists is to enable minifilters to write completion routines that use functions that require being <= APC_LEVEL and not worry about whether the postOp callback is called at DPC. So if according to the warning, "FltDoCompletionProcessingWhenSafe cannot be called for I/O operations that can be directly completed by a driver in the storage stack", then this is like saying that FltDoCompletionProcessingWhenSafe cannot be called for operations that might be completed at DPC_LEVEL, which is the only case where it is useful.

I'll talk about the actual deadlocks in a post next week.

Thursday, November 4, 2010

ObQueryNameString can return names with a NULL Buffer ( and an example with SR.sys)

ObQueryNameString is a very useful API. It's used in a lot of places and is a pretty good choice if you want to find the name for an OB object. However, using it is not without pitfalls. At the moment the documentation page on MSDN has this to say in the Remarks section: "If the given object is unnamed, or if the object name was not successfully acquired, ObQueryNameString sets Name.Buffer to NULL and sets Name.Length and Name.MaximumLength to zero.". What is not clearly spelled in there is the fact that the return status in this case will be STATUS_SUCCESS.

So let's recap. Any app developer can call ObQueryNameString and get STATUS_SUCCESS, but the Name.Buffer will be NULL and they might not expect that. I've seen this issue over and over again. People get a reference to an object and they query the name, get a NULL buffer and then try to read/compare/do whatever with it and they get a visit from the bugcheck fairy. Please note that since the Length and MaximumLength are both 0, people would be safe using the Rtl functions since they tend to check these sort of things..

It is interesting to understand the reasons about how people can get bitten by this as well. The documentation specifies that "If the given object is unnamed, or if the object name was not successfully acquired...", which I guess for most people it translates into "if the name was not successfully acquired then I will some error NTSTATUS… if this object is unnamed then it's not clear what I get, maybe also some error code?..". So I suppose that people that create named objects that they own (or objects that the system creates and are guaranteed to be named) imagine that they can never get the NULL buffer and STATUS_SUCCESS. But any named object can become unnamed when it is deleted. After all, the namespace entry is simply an additional reference to the object and deleting a named object simply deletes that reference, but the object might still be kept around by other references. One easy way to see this is to follow the calls to IoCreateDevice. For example, for an unnamed device one can see this :

Imediately after a IoCreateDevice for an unnamed device:

3: kd> !devobj 93602e48  
Device object (93602e48) is for:
  \FileSystem\FltMgr DriverObject 92cb6660
Current Irp 00000000 RefCount 0 Type 00000003 Flags 00000080
DevExt 93602f00 DevObjExt 93602f30 
ExtensionFlags (0x00000800)  
                             Unknown flags 0x00000800
Device queue is not busy.

3: kd> !object 93602e48  
Object: 93602e48  Type: (922d6440) Device
    ObjectHeader: 93602e30 (new version)
    HandleCount: 0  PointerCount: 1

And immediately after a named device:
2: kd> !devobj 930e0628  
Device object (930e0628) is for:
 FltMgr \FileSystem\FltMgr DriverObject 92f691e8
Current Irp 00000000 RefCount 0 Type 00000008 Flags 000000c0
Dacl 96fd2eec DevExt 00000000 DevObjExt 930e06e0 
ExtensionFlags (0x00000800)  
                             Unknown flags 0x00000800
Device queue is not busy.

2: kd> !object 930e0628
Object: 930e0628  Type: (922d7508) Device
    ObjectHeader: 930e0610 (new version)
    HandleCount: 0  PointerCount: 2
    Directory Object: 96e61948  Name: FltMgr

Please notice how the pointer count is different. Once the named device is deleted (IoDeleteDevice), the reference from the OB namespace is removed (and the object's name in the OB header is changed) and then, when the reference count eventually reaches 0, the object is freed. However, it anyone calls ObQueryNameString on one of those references, they will get the NULL Name.Buffer...

So it is perfectly possible for a driver that is working with an object that it knows it must be named to actually get in the window between when the object is removed from the OB namespace and the time the final reference is released (the driver will of course have a reference of its own in order to be able to access the object… ). So that this means is that calling ObQueryNameString might return STATUS_SUCCESS and a NULL Name.Buffer even for a named object.

I've recently had the pleasure to debug an issue with SR.sys and my virtual volume drive on XP SP3. I will share it since it was somewhat interesting and it points out to this specific issue. This is what the stack looks like:

1: kd> lm v m sr
start    end        module name
f8489000 f849af00   sr         (pdb symbols)          d:\symbols\sr.pdb\9D5432B7234C4CD2A8F6275B9D9AF41F1\sr.pdb
    Loaded symbol image file: sr.sys
    Image path: sr.sys
    Image name: sr.sys
    Timestamp:        Sun Apr 13 11:36:50 2008 (480252C2)
    CheckSum:         00012604
    ImageSize:        00011F00
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4
The problem in SR is this one:
sr!SrGetObjectName+0xd4:
f849105c ff15c49b48f8    call    dword ptr [sr!_imp__ObQueryNameString (f8489bc4)]  <- call ObQueryNameString
f8491062 3bc3            cmp     eax,ebx   <- check for STATUS_SUCCESS
f8491064 894514          mov     dword ptr [ebp+14h],eax  <- save the status...
f8491067 7c24            jl      sr!SrGetObjectName+0x105 (f849108d)
f8491069 0fb707          movzx   eax,word ptr [edi]   <-  this is the Length member of the UNICODE_STRING for the name
f849106c 8b4f04          mov     ecx,dword ptr [edi+4]  <- this is the Buffer member of the UNICODE_STRING..
f849106f d1e8            shr     eax,1   <- calculate the number of characters instead of the number of bytes
f8491071 66897702        mov     word ptr [edi+2],si   <- write some value in MaximumLength… 
f8491075 66891c41        mov     word ptr [ecx+eax*2],bx  <-   write in the buffer a 0 (basically, make sure the string is NULL terminated).. But ECX can be NULL
The stack when I hit this problem looks like this:
1: kd> kbn
 # ChildEBP RetAddr  Args to Child              
00 f80b7944 f849440d 00000000 81dc0a18 e10eac08 sr!SrGetObjectName+0xed
01 f80b7990 f848ecf2 81dc0a18 8239a818 f80b79c0 sr!SrCreateAttachmentDevice+0x99
02 f80b79c4 f848ee0f 8239a818 8239a8d0 81fb4d48 sr!SrFsControlMount+0x2e
03 f80b79e0 804ef18f 8239a8d0 81fb4c90 81fb4c90 sr!SrFsControl+0x4b
04 f80b79f0 80581bc7 00000000 81dc0a18 806e6a4c nt!IopfCallDriver+0x31
05 f80b7a40 804f53d6 c000014f f80b7b00 00000000 nt!IopMountVolume+0x1b9
06 f80b7a70 80582bc0 81e1f268 81dc0a18 f80b7ba4 nt!IopCheckVpbMounted+0x5e
07 f80b7b60 805bf444 81dc0a18 00000000 81fc6600 nt!IopParseDevice+0x3d8
08 f80b7bd8 805bb9d0 00000000 f80b7c18 00000040 nt!ObpLookupObjectName+0x53c
09 f80b7c2c 80576033 00000000 00000000 00000001 nt!ObOpenObjectByName+0xea
0a f80b7ca8 805769aa 009bef80 00100001 009bef24 nt!IopCreateFile+0x407
0b f80b7d04 8057a1a9 009bef80 00100001 009bef24 nt!IoCreateFile+0x8e
0c f80b7d44 8054161c 009bef80 00100001 009bef24 nt!NtOpenFile+0x27
0d f80b7d44 7c90e4f4 009bef80 00100001 009bef24 nt!KiFastCallEntry+0xfc
0e 009beef4 7c90d58c 7c80ec86 009bef80 00100001 ntdll!KiFastSystemCallRet
0f 009beef8 7c80ec86 009bef80 00100001 009bef24 ntdll!NtOpenFile+0xc
10 009bf1f0 7c80ef87 01be31e8 00000000 01be7bf0 kernel32!FindFirstFileExW+0x1a7
11 009bf210 751b1e05 01be31e8 01be7bf0 751a2a04 kernel32!FindFirstFileW+0x16
12 009bf240 751aad1f 009bf714 00000001 000e1358 srsvc!Delnode_Recurse+0x12e
13 009bfb34 751abd1f 009bfd54 7c97b440 7c97b420 srsvc!CEventHandler::OnFirstWrite_Notification+0x3cd
14 009bff60 7c927ba5 00000000 0000006a 000e5f40 srsvc!IoCompletionCallback+0x17a
15 009bff74 7c927b7c 751abba5 00000000 0000006a ntdll!RtlpApcCallout+0x11
16 009bffb4 7c80b713 00000000 00000000 00000000 ntdll!RtlpWorkerThread+0x87
17 009bffec 00000000 7c910230 00000000 00000000 kernel32!BaseThreadStart+0x37
So as you can see, in the mount path SR.sys is trying to create their device to attach to the volume and while doing that it tries to get the name for this device:
1: kd> !devobj 81dc0a18 
Device object (81dc0a18) is for:
  \Driver\IvmVhd DriverObject 81fad590
Current Irp 00000000 RefCount 1 Type 00000007 Flags 00000050
Vpb 81ea0f10 Dacl e1f17924 DevExt 81dc0ad0 DevObjExt 81dc0c30 
ExtensionFlags (0x00000002)  DOE_DELETE_PENDING
Device queue is not busy.
This happens to be my virtual volume device, which as you can tell from the DOE_DELETE_PENDING flag, is about to be torn down. So what this all looks like is this:

1. Something is trying to open a file, see the IopCreateFile (frame 0xb)

2. Io manager, while trying to send the IRP_MJ_CREATE irp (frame 7) wants to make sure the volume is mounted. Please note that at this point the volume device is still in the OB namespace, since otherwise the ObpLookupObjectName call (frame 8) would not have been able to reference it… So at this point IO manager resolved the name to a device object and it now has a reference to the device...

3. IopCheckVpbMounted (frame 6) finds the volume is not mounted (since I dismount it before tearing it down) so it tries to mount it…

4. SR.sys gets the mount request and is trying to build a device to attach to the newly mounted volume (in case the mount succeeds). This is pretty standard stuff for a legacy filter… Anyway, in doing so it calls ObQueryNameString which no longer finds a name for the device and returns a NULL buffer. SR checks for NT_SUCCESS but doesn't check the buffer to make sure it's not null (or even the length which is 0) and it blindly tries to make sure the string is NULL terminated (which is also pointless since ObQueryNameString documentation mentions that "The object name (when present) includes a NULL-terminator and all path separators "\" in the name.")… bugcheck.

What my driver did was simply call IoDeleteDevice somewhere between frame 8 and frame 0.

I'm willing to bet that not checking for the null Name.Buffer is a pretty common mistake. For example, there is some code posted on a blog that looks like this:

Status = ObQueryNameString(FileObject->Vpb->RealDevice,OBI,Returned,&Returned ); 

if(NT_SUCCESS(Status)) { 
if(Root) { 
wcscat(OBI->Name.Buffer,L"\\"); 
OBI->Name.Length+=sizeof(WCHAR); 
} 

Thursday, October 28, 2010

Useful Models - how choosing the right abstraction can help design and some useful abstractions for working with minifilters

The poll on the site indicated this was the topic most people were interested in so here it is.

I find myself quite often in the position of trying to explain why something doesn't work the way someone expects it would. I guess this is due in large part that the work I do (storage and file systems) is something that people interact with quite often but in fact operates quite differently than the abstraction it presents to the users. I've mentioned this in my other posts anyway…

So in order to explain why some architecture won't work, I try to find an analogy or a model that would immediately make the problem obvious. Some of these models are very dependent on the problem I'm dealing with while some others I keep reusing. Some of the models are obviously not practical, but they highlight a certain features of the system. It would be nice if these models could be implemented as actual tools (like Driver Verifier) but the reality is that in some cases the effort to write something like this would not justify the benefits… So I guess most of them will remain in the realm of thought experiments but they can be useful nevertheless...

I'll go through a list of commonly asked questions and the models that I find help explain the problem. I'm sure most of the readers of this post could contribute their own examples so please do so through the comments.

Q: Why not send the file name directly to our minifilter from a service or some other user mode program ?
A: it really depends on the other minifilters on the system. The model here is a minifilter that implements ALL of the namespace perfectly, with file IDs and hardlinks and so on, at its level, and below itself it keeps a flat structure where all streams are identified by GUIDs and there are not directories. If your minifilter happens to be below such a filter then obviously the name of the file at your level (which is a GUID) has absolutely nothing to do with the name the user mode service sees (which can be a regular path). Now, it must be said that any minifilter that does anything like this to the namespace would be in the virtualization group, so if you are above the virtualization group you don't have this problem. But if you are IN or below the virtualization group, then you must take this into account.

Q: Why not communicate with my minifilter through a private communication channel and have it open and read files on behalf of my service ?
A: if you are in or below the virtualization group, see the example above. If you are below the AV group, then you should always think about malware. Let's say you do something very benign, like open your own file and read some configuration data (as opposed to opening and parsing or executing random user files). If there is a vulnerability with your parsing code, this allows someone to write a file based exploit targeting your product and no AVs will be able to see your accesses to the file and catch the vulnerability. Unfortunately, there isn't a good generic malware model so you need to construct your own every time you need to explain why bypassing some security measure is not a good idea…

Q: Why not create a back-up of a VHD file while the volume is mounted ? (which is another way of saying "why not try to read the data on a mounted volume by directly accessing the sectors ?").. This is a question that's not really related to file systems but to the storage stack.. However, I find a lot of people are confused about this and keep trying to read mounted volumes.
A: the model I find helps is that of a volume with a file system on top that on volume mount reads everything into memory and then it only writes the odd bytes (byte 1, 3, 5 and so on) of anything and keeps the even bytes in a cache, until it gets either a flush or a dismount. This makes immediately visible what would happen if you tried to read it. However, once I mention this people immediately ask whether we could flush and then take a snapshot, but then I point out that immediately after the flush the system might already have received some writes and then only the odd bytes have been written so you need a way to guarantee that no more writes happen on the file system, and the only way to do that is to dismount it.

Probably the most powerful model that exposes a lot of issues with filters (not only file system, any filters of any component really) is the "filter attached on top of itself" model. This is important because in general anything you can do in your filter someone else can do in theirs. For example, let's say the discussion is whether creating a new FSCTL that is currently unused and sending it down the FS stack to your filter is a good idea (spoiler: it's not). In the general case this wouldn't work with your filter attached twice, since all the IOCTLs will be captured by the top filter. This might not be an obvious problem (because depending on what the filter should do with the IOCTL , it might still work fine), but then consider that someone else can write a filter just like yours using the same IOCTL derived through the same mechanism and then you can expect more serious problems. So in this particular case you would want to make sure to either use a communication mechanism guaranteed to deliver messages directly to your filter like a control device or (if using a minifilter) communication ports. The same applies for file names (what if there already is a file with that name?) and other named resources.. Thinking about what would happen if your filter would be attached on top of itself is always an interesting thought experiment and highly recommended since it will expose potential problems with your design. Once you know what the problems are you can decide about how likely it is to happen and whether you should address the issue..


I thought I had more models and I should have done a better job at keeping track of them but I can't remember anymore right now. I will update the post when I do.

Thursday, October 21, 2010

Filtering in the Windows Storage Space

This post assumes that reader had some knowledge about the IO subsystem in windows.

The file system stack is simply a set of drivers between the IO manager and the file system (including the file system). These drivers are usually referred to as file system filters. In general the file system is the component that implements the hierarchy of files and directories and perhaps an additional set of features (like byte-range locking or hardlinks and so on). The file system filters usually add some functionality on top of what the file system provides (such as encryption or replication or security (think anti-virus scanners), quota management and so on). Most of these features could be implemented at any of these layers (for example, byte-range locking is usually done in the file system, but a filter can do it as well…). The decision is usually driven by customer requirements and even in the OS itself some things are done in filters, so that customers that don't need the feature don't pay the price.

For a pretty complete list of types of things file system filters can do, one can take a look at the list here. Of course, this is not a complete list, but still it shows how rich the ecosystem really is. I remember hearing that an average user on a Windows machine is running around 4 or 5 file system filters, usually without even realizing it.

The interface between the IO manager and the file system is very rich and complex. There are very many rules and everything is asynchronous which makes things very complicated. On top of this, while there is support in the NT model for filtering, it doesn't really provide some of the facilities that file system writers need (for example, there is not a lot of support for getting the name of a file or for attaching context to a certain file). This is where minifilters comes in. The minifilter infrastructure was written to primarily address some things that almost all file system filter need, without really changing the filtering model too much (which is why I'm avoiding the phrase "minifilter model" since it doesn't really change the IO model much, it just adds some features to it). This is all implemented via a support driver called filter manager. Filter manager is a legacy filter that is a part of the operating system and it provides things such as :
1. Support for contexts
2. An easier model for attaching to a volume
3. Easier model for file name querying
4. Support for unloading filters
5. Predictable filtering order
6. Easier communication between a user mode service and a driver.

Some of these are just nice features (like context support, where a legacy filter can still reliably implement their own scheme if they want) while some are downright impossible in the legacy model (for example, it used to be very problematic to make sure that an anti-virus filter would not be loaded below an encryption filter (which would make scanning files useless)).

The numbers that I've heard were that a legacy filter needs about 5000 lines of (very complicated and highly sensitive) code to just load and do nothing. With the minifilter model I'd say less than 50 are necessary, and most of them are just setting up structures and such.

Of course, a legacy filter can do all a minifilter can because filter manager itself is a legacy filter and it doesn't use private or undocumented interfaces. However, since the minifilter model is supported on all platforms since Windows 2000 there is really no reason for anyone developing a new filter to write a legacy filter. At least, that's my view. There are some people who disagree with this statement (as with any other model in fact) but the fact is that Microsoft is moving towards making the legacy model obsolete.

It is important to note that the storage infrastructure consists of two big parts, the file system stack and the disk stack. The disk stack deals with IO that is issued by the file system. The file system stack encapsulates all the complexity of operating with files and folders and such and issues just sector reads and writes. The disk stack has no concept of byte range locks, files and so on. What is deals with are sectors. The types of filters in this space are categorized about what they filter (disk, partition or volume) as well as the functionality they provide (encryption, compression, replication and so on). For example filters can offer things like volume snapshots, full volume encryption or full disk encryption, volume or partition replication, performance monitoring at all levels and so on.

As you can see, the storage subsystem is very rich and most of the time filters play a huge role in it (at least in the Windows world, where one can't just modify the source to add features to an operating system component). However, with so many ways to do things it is sometimes hard to know what architecture is best suited for a certain type of problem, and unfortunately selecting the wrong one can have a huge impact on the cost and complexity of a project.

Monday, September 20, 2010

Namespaces (part 1) - the OB namespace

I've been getting a lot of hits to my page about name usage in file system filters so I've decided to expand on the subject of names a bit further. This blog post is more about software design (and especially about OS design) and less about file system filters.

The role of language in shaping the way we think is a very interesting subject and one I've been interested in for a while. The book "Language in Thought and Action" is a very good introduction to the subject. One of the ideas in the book is that the mapping of names to objects changes the way we think about the object. While this is true to a certain extent in programming (think about how often you heard the phrase "well, this API would have been better named BlahBlah …"), computer science as a discipline has a completely new class of problems that I'd like to focus on in this post. The problems associated with actually designing namespaces. I'm not sure why designing and identifying namespaces isn't as popular in computer science circles as other concepts like indirection and variable scope because it's at least as important.

I don't think writing a formal definition of a namespace would actually be very interesting so I'll go straight to some examples of namespaces.

Probably the best known one is the file system namespace. The main elements of this namespace are file and directory names and the namespace serves to map file paths to streams of bytes. Also quite well known is the registry and it serves a very similar purpose. For people writing kernel mode drivers in windows also a pretty familiar one is the object manager namespace (or the OB namespace), where object names are used to identify kernel objects.

In some operating systems users are used to see and work with some other namespaces grafted into the main OS namespace (in windows users don’t usually see the OB namespace, but it can be explored using tools like WinObj ). For example, the storage devices namespace, the COM ports namespace or the running processes namespace.

For developers some familiar namespaces are the types namespace and the variables namespace (in the compiler).

But there are others even more interesting. For example, a namespace doesn't have to use ASCII or UNICODE strings to identify objects. If one were to use numbers, like 1,2,3 and so on the namespace would be an array. Similar, the process handles form a namespace, where the handle is used as the name. By now it's probably pretty clear that any key-value type of structure is a namespace. Even memory is a namespace as well, where the name is the address.

Now that we have some examples of namespaces we can look at some choices the designers of these namespaces made and what is the impact of those choices on the way they are used.

First, let's look at the object manager namespace in windows (which, as I said before, I'll refer to as the OB namespace).

I'll start by listing some of the properties of this namespace. The names in the OB namespace are UNICODE strings. As is usually the case with namespaces where the names are strings, the namespace implements a hierarchy of names and it is public. Some interesting features are that it supports links from one point in the namespace to another part and that it supports objects that don't have a name (we could treat anonymous OB objects as a different namespace but that's not particularly interesting).

Support for anonymous objects is by far the choice with the biggest impact because it means that whoever implements the namespace can't use the fact that the object is removed from the namespace as an indication that the object needs to be deleted. So they must use some different technique to track object usage and in the case of OB that technique is reference counting. From a user's perspective this means they have to do the little dance that involves increasing the reference count before sharing the object with anyone and decreasing the reference count when they're done using it. It also means that removing an object from the namespace (a delete) can happen immediately on an object (as opposed to it happening when the object is closed, like in file systems). Another implication of this architecture is that it's hard keeping logs of things because an object might not always have a name, so how does one log it ? The memory address doesn't usually convey any information about the object.

The fact that a namespace supports links is also quite interesting. The designer needs to decide whether they support links to directories in the namespace or just links to "leaves" (like files). For example NTFS supports hardlinks only between files, not directories. The OB namespace however supports links to directories, which means the OB namespace can contain loops. So the designer must come up with a way to deal with potential loops in the namespace. Another interesting implication is the fact that the caller might need to remember which way they arrived at an object in the namespace (the path to that object) in a way that takes links into account. The OB namespace doesn't do that but it is required for some features (like file system symlinks) so the users of the namespace must implement that themselves.

One final characteristic is that the namespace is hierarchical. Hierarchical namespaces have some advantages from the perspective of the implementer since they allow grouping objects that belong together. The main advantages are security and support for isolation. A flat namespace on the other hand is easy to implement, but it's very limited as it is basically just a hash.

To get a better picture of the implications of implementing a hierarchical namespace versus a flat one, let's consider some namespaces that don't support hierarchies, like the named synchronization primitives namespace in windows (events, mutexes and so on). It's easy to get name collisions so each Windows application must make sure it's using a name that no one else is using. And then from a security perspective there is no way to limit listing them. Basically, you can either prevent someone from seeing any of the names or allow them to see all the names. Access control is possible, by only on a case by case basis, and there usually isn't a way to inherit security permissions from another object.

The isolation part is also pretty important. For example, consider the fact that Windows supports sessions. If helps to keep those resources that are semantically linked into a directory, so they can be easily enumerated and operated on (even if they are just links to the actual object). Isolation is really useful in virtualization because the user of that part of the namespace doesn't necessarily see all the available objects, just the ones they're supposed to see.

This is getting pretty long so I'll stop here and talk about the file system namespace in a different post. If there is enough interest I might talk about other namespaces like the processes namespace (please leave some comments if this sounds interesting to you).

Saturday, September 18, 2010

I'm back

Hello everyone, I'm sorry I've been neglecting this blog for the past couple of month, a lot of things have changed and I've been really busy trying to adjust. I'm still not quite there yet but I'll try to do a better job with the blog from now on.

The good news is that I've been thinking about all sorts of things that I think would make good posts and so I should have some new material coming up.

In the mean time, please feel free to let me know if you have any suggestions for future topics and i'll do my best.

Thursday, February 11, 2010

Context Usage in Minifilters

I’m not sure why but in spite of there being pretty good documentation and even a sample available, the topic of how Contexts work and how filters should use them comes up a lot.

There are a couple of rules that govern contexts and pretty much everything follows from the interaction between these rules (this applies to all contexts). Please note that this is more of a design discussion and the implementation might be slightly different:

  1. When the reference count on the context gets to 0, the memory is freed.
  2. Any pointer to the context needs to have a corresponding increment on the reference count. This is done transparently when the filter requests the context via one of the functions (FltAllocateContext(),FltGetXxxContext(),FltReferenceContex() and even FltSetXxxContext() and so on).
  3. A context needs to be linked to the underlying structure (i.e. StreamContext  to the stream, StreamHandleContext to the FILE_OBJECT, VolumeContext to the volume and so on…; please note that we are talking from a design perspective, the implementation of exactly which structure has the pointer to the context might be different, but this is irrelevant for this discussion).

This is pretty much it. I’d like to walk through the most common scenarios and explain how those rules apply:

A filter allocates a context (FltAllocateContext) and it gets a pointer to a context (refcount 1). The context is not linked to anything at this point in time. If the filter calls FltReleaseContext, the refcount will drop to 0 and the context will be freed. If the filter tries to attach the context to the structure (say by using FltSetStreamContext – i’ll use StreamContexts for the rest of the discussion, and the underlying structure in this case is the SCB (Stream Context Block; or, for file systems that don’t support multiple streams per file (aka alternate data streams), the FCB)), then there are three cases:

  1. It succeeds. refcount is now 2, one for the link from the SCB and the other one is the one the filter has. 
  2. It fails and the filter doesn’t get a context back (for whatever reason: memory is low or the filter passed a NULL pointer for OldContext or there is some other failure). In this case there is still only one pointer, the one the filter has, so the refcount needs to be 1.
  3. It fails and the filter gets another context back (there already was a context attached and OldContext was not NULL). Now the filter has two contexts, the original context that it has allocated which has a refcount of 1 (only the filter has a pointer to it) and a new context (though the name is OldContext), with a reference count of at least 2 (because there are at least two pointers to it, one from the underlying structure, the SCB, and one that was just returned in OldContext so the filter can use it – there could be other references from other threads, but to keep things simple we will ignore those). The filter will need to release the original context it has allocated because it can’t use it (and since the refcount was 1 this will drop it to 0 and will free it). The filter will also need to eventually release the reference it got on OldContext, after using it (which will drop it back to 1, which represents the pointer from the SCB to the context).

Before we go any further i want to discuss what a filter can do when getting a context fails for whatever reason (this includes allocation failures and failing to set or get the context). Some filters can simply ignore that object (for example, a filter trying to log file IO might make a note in the log that IO to file X will not be logged and that’s that). Other filters might work in a degraded mode (for example, an anti-virus filter that is trying to be smart about scanning a file when it’s closed might want to remember whether there was any write to the file. If it fails to get a context it might scan the file anyway… performance might be worse but it will still work). And yet another case is where a filter might simply not be able to work when it doesn’t get a context. In that case the filter might want to allocate and initialize the context early enough so that the operation can be failed, usually in the Create path so in case the allocation fails the filter can fail the Create and the file won’t be opened at all.

Yet another thing to mention is that if a filter needs to use a context at DPC (let’s say in postWrite) then the context needs to be allocated from nonpaged pool and since the context functions are not callable at DPC the recommended way is to get the context (which might involve allocating it and attaching it) in the preOperation callback and pass it through the CompletionContext to the postOperation callback which can use it and then call FltReleaseContext to release the reference (yeah, even at DPC if the context is allocated from nonpaged pool).

One might wonder why the strange dance with the OldContext and NewContext. Couldn’t the filter just check if there is a context and only allocate one if there isn’t one ? Well, of course it could but because the environment is asynchronous, multiple threads might be doing the same thing at the same time, and they will all check if there is a context, find none, allocate a context and set it so now you have 10 threads each trying to attach an SCB with a different context… So the context operation needs to be a CAS (CompareAndSwap) so that only one thread succeeds.

Thus filter should not really start using a context that it allocated until it actually manages to attach it to the structure. However, immediately after the context is attached another filter might get to it, so it needs to be in some defined state, otherwise the other filter will get an invalid context (more on this later). The steps need to be something like this (this is pretty much the logic in CtxFindOrCreateStreamContext in the ctx sample in the WDK):

  1. context = FltGetStreamContext()..
  2. If we didn’t get one:
    1. context = NULL
    2. FltAllocateContext (NewContext)
    3. Initialize context to whatever default values make sense. Please note that those values need to take into account the current reference as well.. I’ll explain more below.
    4. FltSetStreamContext(NewContext, OldContext)
    5. If it failed:
      1. FltReleaseContext(NewContext) –> no point in keeping it around. Since we had the only reference refcount was 1 and it dropped to 0 so it will be freed.
      2. If we got OldContext, context = OldContext
      3. else, we didn’t get OldContext but we also couldn’t attach our context for some reason – the filter needs to continue without a context, whatever that means… (and no, KeBugCheck is not a good idea :)… )
    6. if it didn’t fail –> context = NewContext
  3. At this point context points to the context to use. If it is null, something went wrong and we should bail… (we could bail here or in 2.5.3., doesn’t matter). By bail i mean we should either fail the operation or popup a warning to the user or mark somewhere that we missed one so the results are not reliable anymore.. doesn’t matter.
  4. do things with context….
  5. FltReleaseContext(context) -  here we release our context. We can do this later, for example if we get the context in the PreOperation callback we might want to pass it via the CompletionContext to the PostOperation and release it there. Or we could queue some work item and pass it in a context and have the work item release it. Anyway, once the context gets release the reference count on the structure will drop back to 1 (for the link from the SCB to the context).

In step 2.3. i said that the context needs to be initialized to whatever values make sense, but IT MUST take into account the current reference. Well, this is not always needed but it depends on the particular design of the filter (it’s usually needed though so keep reading). Consider a filter that uses a StreamContext to keep track of how many threads it has doing IO on a Stream and is using a handle that the filter opened via FltCreateFileEx2. Let’s say that when the count gets to 0 the filter will call FltClose on the filter’s handle. Now let’s imagine a case where in step 2.3. the filter simply initializes the count to 0. The logic would be something like this:

  1. context = GetStreamContext(); // allocate new context or get the existing one. also get a handle by calling FltCreateFileEx2(…) if needed
  2. context->Count++
  3. Do things on context->Handle
  4. context->Count--;
  5. If (context->Count == 0) then FltClose(context->Handle).
  6. FltReleaseContext(context);

Do you see the problem here ? What happens if there are two threads, T1 and T2, and T1 allocates the new context, initializes so that context->Count is 0 (which means, it is initialized to a default value that doesn’t take into account the current reference) and then it sets the context (refcount is 2, 1 for T1 and one for the underlying SCB), before getting to step 2. it gets preempted by T2, which starts at the top. T2 will get a context (refcount is 3, 1 for T1, 1 for T2 and one from the SCB), it will increment the count (so context->Count is 1), it “does things”, then it decreases the count in step 4. (so context->Count is now 0) and then step 5. will proceed to close the handle. Step 6. will release the context (so refcount drops back to 2). Then when T1 resumes it will be at Step 2. and it will again increase the context->Count to 1 (from the wiki link above, this is a manifestation of the ABA problem), then it will do things on context->Handle which has been closed….. And there you have it… This could have been avoided if GetContext() actually initialized the newly allocated context Count to 1. This complicates things a bit because Step2. now might need to only be called when the context was not allocated in this thread, meaning that Step2. will probably need to move in GetContext() and so on..

Another thing worth mentioning is that once the underlying object is torn down, the link from it to the context will be broken (i.e. the pointer from the underlying object will go away), so the reference count will need to be decremented. In most cases where there are no outstanding references (there are no other threads using the context) the refcount will go to 0 and the context will be freed (and the filter’s context teardown callback will be called, if one was registered). There are a couple of implications this has. If a filter simply allocates a context, associates it with an object and then calls FltReleaseContext() (which is the normal way to set up a context), the filter doesn’t need to do anything else to make sure the context goes away. It will be torn down when the underlying object is torn down.

The other thing that follows from the fact that the context is tied to the lifetime of the underlying object is that if a filter can never leave the context in a bad state assuming that it will go away, because the underlying object might hang around for a while and get reused, reviving the context. For example, for a StreamContext where a filter has pointer to an ERESOURCE and an open count, it would be a mistake to free the ERESOURCE when the open count gets to 0 under the assumption that once the last handle goes away the SCB will go away as well, because that might not be true. The file system might cache the SCB and if a new open to the same files comes along the file system will reuse the cached SCB, which means that the filter will get a context that has an invalid pointer to an ERESOURCE. So in this case the right place to free the ERESOURCE is in the context teardown callback.

Finally, the last thing i want to mention is what FltDeleteContext does. FltDeleteContext unlinks the context from the underlying object. So if a filter decides it no longer needs a context associated with a stream (for example), it will need to do something like this:

  1. context = FltGetStreamContext();
  2. If (context != NULL)
    1. FltDeleteContext(context);
    2. FltReleaseContext(context);

At this point it should be obvious that FltDeleteContext needs to be called before FltReleaseContext (because since FltReleaseContext will release the reference associated with the context it is not safe to use context at all after calling it; FltDeleteContext will only remove the reference from the underlying object, if it is set, not doing anything about the current reference). Please note that after FltDeleteContext unlinks the StreamContext, any thread trying to get it from the object will not find it. This means that the filter should not try to use it in a meaningful way since other threads might not see the changes. Basically, once FltDeleteContext was called, the filter should simply call FltReleaseContext…

I hope this makes sense. If nothing else it should be useful when you have trouble sleeping (i almost fell asleep twice while proofreading it)

Verifier Checks: A filter has completed an operation but does not implement name provider callbacks

FILTER VERIFIER ERROR:   A filter has completed an operation but does not implement name provider callbacks. For example if a filter completes an IRP_MJ_CREATE and is not a name provider it is possible that during name queries the FO will make it's way to the file system and bugcheck. (Filter = Xxx, Cbd = Xxx)

This is a new check in Win7 and I get a lot of questions about it. In order to understand what it means we should talk briefly about how FltGetFileNameInformation and friends work.

In order to normalize a file name Filter Manager will get a path to a file in a number of different ways, including by querying the file system for it, and then it will open directories on that path and query for long names. In order to optimize this as much as possible (as you can imagine this is rather time consuming… incidentally, a minifilter should use opened names where possible as the perf is better) the decision was made to implement a separate mechanism from the normal IO path. Minifilters that change the path in any way should implement some additional callbacks, which we refer to as “name provider callbacks” and filter manager will call those callbacks when it needs to generate and normalize a name. If a minifilter does implement those then it is called a name provider. The best part about this is that if a minifilter is not a name provider then it can stay out of the name resolution path completely. If none of the minifilters on a volume are name providers then filter manager can skip all of them when it tries to resolve a name and go directly to the driver below.

How does this tie into the verifier check above ? If filter manager sees a filter successfully complete any operation that might have an impact on the namespace (like a IRP_MJ_SET_INFORMATION with FileRenameInformation or FileLinkInformation or FileShortNameInformation and so on) then it expects that the minifilter implements the name provider callbacks as well. Now let’s say that the minifilter successfully completed a rename for file “A” to file “B”. If the minifilter in question does not implement name provider callbacks, filter manager will not even ask it for the name. It will go straight to the name provider below (or the file system) and it will get back the old name , “A”. This clearly breaks the abstraction that each filter must implement.

There is one more operation other than those that change names that is particularly important. If a minifilter successfully completes IRP_MJ_CREATE (in this particular case returning STATUS_REPARSE does not count as success) then the FILE_OBJECT should NEVER under any circumstances be seen below that minifilter (the file system will try to interpret the private fields in it as its own, which can lead to bugchecks or data corruption). So if a minifilter successfully completes a create but doesn’t implement name provider callbacks then name queries will bypass it completely and the FILE_OBJECT will be shown below.

So now that we explained all the context, let’s go into what’s allowed and what’s not:

  • a minifilter can always fail any operation, including IRP_MJ_CREATE and IRP_MJ_SET_INFORMATION without implementing name provider callbacks. Also, completing an IRP_MJ_CREATE with STATUS_REPARSE counts as a failure in this context, even though STATUS_REPARSE is a success code.
  • if a minifilter successfully completes an IRP_MJ_CREATE (excluding STATUS_REPARSE), even if it doesn’t change the name space at all (like an SFO (shadow file object) type of filter) it must still implement name provider callbacks, even if they do nothing and are only passthrough (meaning that they get the name from below by calling FltGetFileNameInformation and return it). This minifilter is called the owner of the FILE_OBJECT.
  • if a minifilter implements any type of name space virtualization (which means that if there is any difference at all between the namespace above the minifilter and the namespace below the minifilter) then it must implement name provider callbacks and it must also implement (or at least make a conscious decision not to support some of the features) all the other name support operations, like directory enumeration, setting and querying short names, directory change notification, dile IDs, renames, hardlinks, reparse points and so on. This is true even if the minifilter does not own any file object (i.e. it never completes an IRP_MJ_CREATE but rather it just changes where files are on the file system).
  • a minifilter should NEVER skip any portion of the IO stack because there is no way to know if the FILE_OBJECT it cares about belongs above or below the filter. For example, a minifilter should never take a FILE_OBJECT and do something like allocate an IRP and send it directly to the file system below, since it might bypass the owner of that FILE_OBJECT. A far more common scenario of this same behavior is a minifilter that receives a HANDLE from its user mode component and it calls ObReferenceObjectByHandle and then it call some FltXxx function with that FILE_OBJECT, without realizing that the owner of that FILE_OBJECT might be above itself.

I hope this makes sense. As always, if you have any questions please don’t hesitate to ask.