This afternoon I was asked a question that comes up once or twice a year by email. I answered as usual, and only later realized that the topic was a reasonably good match for this month’s TSQL Tuesday topic. I thought I might share the question, and the answer, with a larger audience this time.
Every once in a while I’ll receive a question about a monitoring batch whose spid has gotten hung, and attempts to kill the session simply leave the spid in a rollback state for hours or days on end. Almost invariably these sessions are presenting with an external wait type – usually either MSQL_XP or one of the PREEMPTIVE_OS_* wait types. The usual suspects are calls to xp_fixeddrives and other extended stored procedures backed by xpstar.dll and the sp_oa* procedures that support OLE Automation.
Here is an example of a query which will cause the problem. (Note that enablement of xp_cmdshell is a security hazard and thus this code should only be attempted on a test machine)
exec xp_cmdshell 'powershell -command "sleep 1"'
When executed, this command will execute for exactly 100 seconds. Executing the following code will yield some information about what the session is doing.
select session_id, wait_type, command from sys.dm_exec_requests where session_id = <session_id>
If the session running xp_cmdshell is killed, the session will move to a killed/rollback state, but retain the same wait type.
It will remain in this state until the 100 seconds elapse.
Now, this is all well and good for a session that will die after 100 seconds, but what about a session which is truly and intractably hung? Or what if it was a 10,000 second wait? Is there anything to be done?
Well, it depends. In this specific case, yes. The xp_cmdshell extended stored procedure is running its command out of process to SQL Server. In fact, you can see the life of the session using Sysinternals ProcMon. Here I’ve reduced the wait to 1 second just to capture it in a neat window.
SQLServer creates the cmd.exe process and passes it the command, which in turn calls powershell. Each of these processes is a full fledged process visible in Process Explorer.
So in this particular case, if the sleep command was for 10,000 seconds, after the session kill failed, it would be possible to end the session by killing powershell.exe using Process Explorer – and that does indeed work.
However, there are other external waits which cannot be killed in this way, and these are generally a bit more gnarly to work with. For example, on my machine, it takes about 7 second for the following command to run:
And if I watch it in procmon, I can see that it is SQL Server’s own process which is walking the directory tree.
Now, this command is having no trouble, and I can kill it quite easily. But as you can see, there’s no process other than SQL Server itself involved.
What I usually see is that either xp_fixeddrives, xp_readerrorlogs, xp_enumerrorlogs, or one of the OLE Automation stored procedures gets snagged up while running in process to SQL Server. There’s nothing to kill other than SQL Server itself in this case, and the process generally has to remain a zombie until the underlying problem is solved and/or the SQL Server service is cycled.
In the case of OLE Automation stored procedures specifically, it is possible to run these commands out-of-process to SQL Server just as in the example with xp_cmdshell above. Sometimes this can be done simply by setting the context flag of sp_OACreate. In other cases, it may be necessary to fiddle with registry settings to change the threading model of the assembly being loaded from “Apartment” to “Both” – something which, it goes without saying, is best tried in a test environment first.
So, that’s my experience with stuck spids and external wait types. If you’re lucky, when you get stuck with one of these, it’s an out of process wait. If you’re not… well, you may be waiting for a while longer.