Part 1 was interesting; it isn't clear why he split that into a Part 2 since it adds little to the story and is a paragraph long.
londons_explore 23 hours ago [-]
I assume the fact it is a third party application means debugging gets harder, and the business case for doing so is weaker/none.
But I would hope that some kind of reverse debugger triggered on one of these crashes would make it pretty simple to say "who wrote this 01".
garaetjjte 17 hours ago [-]
You usually hope that TTD points to the culprit in such situations. But once I encountered single-byte corruption that didn't make any sense in TTD trace, there was good value at write and next read was garbage. I never discovered whether that was CPU bug, corruption by GPU shaders, stray kernel writes, or whatever.(I think it's unlikely that CPU bug would manifest with both native and TTD-instrumented runs. Corrupted byte was inside heap allocated memory so it shouldn't be in GPU pagetables at all. Kernel writes wouldn't appear in TTD trace, so really I think that was most likely issue, but how to debug that...)
microgpt 22 hours ago [-]
You could also look at modules loaded into all of those processes that crashed this way.
rramadass 21 hours ago [-]
Part-2 is more than a paragraph and is logically distinct from Part-1. In this, Raymond actually gets the crucial clue from another colleague's debugging efforts which leads him to identify that the bottom byte of HMODULE of the DLL gets overwritten by <something> which is the root cause of the bug; viz.
The “DLL unmapped from memory” crash is just an alternate manifestation of the “somebody is writing 01 bytes to places they shouldn’t” bug. The original bug had a larger bucket spray than we initially thought.
Part-2 is the essence of the solution while Part-1 is a series of investigations and inferences.
taneq 1 days ago [-]
Might have been an “I need to look into this” segueing into “ never mind”?
zabzonk 1 days ago [-]
> The good news for the shell32 team is that they are off the hook; they are the victim. The bad news is that we don’t know who the culprit is.
The story of software development through the ages.
brookst 1 days ago [-]
When you’ve eliminated all possible explanation, it’s time to pack it in.
taneq 1 days ago [-]
Oh man, my journey from idealistic “there is always an explanation” youth to “some days it do be like that, and we may never know why” in a nutshell.
jamesfinlayson 9 hours ago [-]
Yeah I'm nearly 15 years into my career and still once or twice a year I have a moment where I think - there is no way that should ever have worked and I really don't know why it ever worked and didn't cause any issues.
zabzonk 22 hours ago [-]
Or, as the original article suggests, blame someone else.
rwmj 24 hours ago [-]
What MSFT support policy do you need to have the legendary Raymond Chen take a look at it?
I say this because we've reported a bunch of Windows bugs (mainly running Windows under virtualization) and getting them to pay attention at all is an up-hill battle.
hackyhacky 24 hours ago [-]
> What MSFT support policy do you need to have the legendary Raymond Chen take a look at it?
If you have to ask, you can't afford it.
xmodem 18 hours ago [-]
If you can reproduce it reliably and doing so generates some form of telemetry, then just set up some automation to keep doing that in a loop. From as many machines as possible.
No comment is offered on if I have ever gotten a bug noticed this way.
DANmode 19 hours ago [-]
Perhaps it’s a matter of subjective interest!
Often that’s how these things go.
1970-01-01 23 hours ago [-]
>I asked for the 100 most recent crashes in that third party program and put them into a pivot table so I could see the distribution.
Always wondered if crash reporting is some kind of shady business. It's good to know it does, at minimum, do what it promises and give valuable crash data to MS.
DANmode 19 hours ago [-]
Shady, as in, where does that data end up after MS collects it, and why?
kumarvvr 1 days ago [-]
I see posts like this, this deep dive into the call stacks and am always humbled and reminded of the limits of my knowledge about computers and programs.
rramadass 21 hours ago [-]
These sort of bugs require a lot of knowledge about a) Windows Internals b) Tools to debug at that level. Most application-level programmers won't need nor are exposed to these.
However, if you are interested in knowing what is all involved, see; Advanced Windows Debugging by Mario Hewardt and Daniel Pravat - https://advancedwindowsdebugging.com/
Goes both ways, author probably knows little about FPGA programming, React or PyTorch.
Panzer04 1 days ago [-]
Not a programmer?
kumarvvr 1 days ago [-]
I am, for 20 years now. I do embedded stuff too. Still.
Panzer04 24 hours ago [-]
I'm a bit surprised you don't run into things like this then :). Do you use GDB and the like at all?
Or do you mean all the windows specific stuff etc, I guess I was more imaging the call stack etc.
No insult was intended XD
FartyMcFarter 24 hours ago [-]
As someone who has debugged his fair share of tricky low-level issues, the parts that I find impressive in his blog posts are things such as "then we look at the bytes in memory and oh yeah, this looks like an exception record". I would usually not think to do that (or be able to recognise it as easily as I presume he did).
Chu4eeno 19 hours ago [-]
I assume it's mostly just something you learn to recognize after decades of poking at the same things. I remember being impressed with Thiago (Qt developer) being able to immediately tell if a pointer was heap allocated, invalid/unaligned, etc. until I spent more time poring over /proc/*/maps and in gdb.
Never figured out how he could tell someone's Qt version just from an strace excerpt, though.
toast0 18 hours ago [-]
> Never figured out how he could tell someone's Qt version just from an strace excerpt, though.
Sonames might be a big clue? Otherwise, initialization order changes maybe? Sometimes there's enough file content in an strace to be able to see a strong indicator?
Those are just guesses, I do a lot more debugging with pcaps rather than straces. Although you do often want to determine which side of the syscall caused whatever you're seeing in the pcap.
kumarvvr 24 hours ago [-]
I have done everything from desktop apps to web apps and a bunch in between. Regular debugging is good enough for me. Never had the need to go down into call stack level.
Even with embedded programming, regular C debugger has always been enough.
23 hours ago [-]
defrost 1 days ago [-]
That's some doggedly determined back tracing to uncover an unexpected heisenbug (loose meaning).
So a total of 46% of the crashes were due to this rogue force-unload of a DLL. This is a case of bucket spray, where a single underlying cause generates a large number of different types of crashes.
chrisjj 1 days ago [-]
We've not yet seen sufficient evidence this is any type of heisenbug.
brookst 1 days ago [-]
Looking more closely would resolve it one way or the other.
defrost 1 days ago [-]
My hat.
defrost 1 days ago [-]
It's not, by the article, in a strict taxonomy.
In a wider sloppier sense some use the term for bugs that are hard to pin down and exhibit wide behaviours.
nopurpose 23 hours ago [-]
How big and important third-party vendor must be for Raymond Chen to dissect its coredumps?
FartyMcFarter 23 hours ago [-]
Given his seniority, it could also be that he picks whatever bugs he wants to work on. Whether that is from personal interest, frequency of crashes or any other criteria.
When you're at that level in a company, it's rare that someone would be micromanaging what you work on at all times.
IChooseY0u 22 hours ago [-]
Windows COM is super weird and way over engineered.
rpeden 22 hours ago [-]
I actually think COM is an amazing bit of engineering considering its intended use case.
It still feels like a much more advanced way of sharing compiled libraries between different languages than the current default of "export a C ABI and communicate across the barrier via primitive sticks and stones."
COM isn't perfect but I still find it impressive especially since COM/OLE are 40 years old at this point.
microgpt 22 hours ago [-]
It basically is that. It's a standardized sticks and stones. Plus objects for some reason. But I don't think the objects are a bad thing - it allows multiple implementations of sometimes to co-exist - consider using two different GPUs from different vendors at the same time. It took a really long time and a bunch of hacks to make OpenGL support that, but DirectX could always do it (at least at the API level) by just creating two different ID3DDevice objects backed by different code from different DLLs both loaded at the same time.
OpenGL basically loads the GPU driver DLL that directly implements the OpenGL functions while Direct3D uses a COM object with a vtable so it can easily have two different ones.
hackrmn 22 hours ago [-]
The fact that Raymond Chen is debugging these kind of issues, tells me Microsoft is short on staff that has his particular set of skills, handing him the hairiest issues from the annals of Windows. The new hires are probably all about .NET and JavaScript and what have you -- whatever Microsoft is about these days. I doubt it's C/C++. Chen is probably on standby and is paid handsomely as a de-facto VIP consultant. He is a legend, but he's becoming somewhat of a vintage developer.
kjellsbells 13 hours ago [-]
I do wonder how Microsoft will manage the transition of the NT generation. Raymond Chen has been doing this kind of work for thirty years. He probably has, what, another ten years, max? Who are the next generation of Windows gurus that will take up the mantle?
Hanselman is good on the blog part, but not in Chen's class re Windows domain expertise. And Windows is not, despite all that is said, going away anytime soon. I think this could be a real problem.
Hopefully there is a set of 25 year old developers in the Windows team who have deep and growing skills in Winternals, and Microsoft have the good sense to encourage them in their career.
22 hours ago [-]
forestry 21 hours ago [-]
Managed dump analysis in windbg was a thing. It’s been many years since I’ve needed it, though. Service telemetry improved quickly thereafter.
algorithmsRcool 17 hours ago [-]
It is still a thing, just not a very common one since the debugger in VS has become more ergonomic and powerful. But windbg is still the king here, for the most advanced analysis of both managed and unmanaged code and it isn't even close to be honest once you learn the arcane commands and incatations
antonvs 21 hours ago [-]
Feed the info and code to Claude, it'll diagnose and fix this. You're welcome, Microsoft.
https://devblogs.microsoft.com/oldnewthing/20260626-00/?p=11...
But I would hope that some kind of reverse debugger triggered on one of these crashes would make it pretty simple to say "who wrote this 01".
The “DLL unmapped from memory” crash is just an alternate manifestation of the “somebody is writing 01 bytes to places they shouldn’t” bug. The original bug had a larger bucket spray than we initially thought.
Part-2 is the essence of the solution while Part-1 is a series of investigations and inferences.
The story of software development through the ages.
I say this because we've reported a bunch of Windows bugs (mainly running Windows under virtualization) and getting them to pay attention at all is an up-hill battle.
If you have to ask, you can't afford it.
No comment is offered on if I have ever gotten a bug noticed this way.
Often that’s how these things go.
Always wondered if crash reporting is some kind of shady business. It's good to know it does, at minimum, do what it promises and give valuable crash data to MS.
However, if you are interested in knowing what is all involved, see; Advanced Windows Debugging by Mario Hewardt and Daniel Pravat - https://advancedwindowsdebugging.com/
Review of the book by Raymond Chen himself! - https://devblogs.microsoft.com/oldnewthing/20071218-01/?p=24...
Or do you mean all the windows specific stuff etc, I guess I was more imaging the call stack etc.
No insult was intended XD
Sonames might be a big clue? Otherwise, initialization order changes maybe? Sometimes there's enough file content in an strace to be able to see a strong indicator?
Those are just guesses, I do a lot more debugging with pcaps rather than straces. Although you do often want to determine which side of the syscall caused whatever you're seeing in the pcap.
Even with embedded programming, regular C debugger has always been enough.
In a wider sloppier sense some use the term for bugs that are hard to pin down and exhibit wide behaviours.
When you're at that level in a company, it's rare that someone would be micromanaging what you work on at all times.
It still feels like a much more advanced way of sharing compiled libraries between different languages than the current default of "export a C ABI and communicate across the barrier via primitive sticks and stones."
COM isn't perfect but I still find it impressive especially since COM/OLE are 40 years old at this point.
OpenGL basically loads the GPU driver DLL that directly implements the OpenGL functions while Direct3D uses a COM object with a vtable so it can easily have two different ones.
Hanselman is good on the blog part, but not in Chen's class re Windows domain expertise. And Windows is not, despite all that is said, going away anytime soon. I think this could be a real problem.
Hopefully there is a set of 25 year old developers in the Windows team who have deep and growing skills in Winternals, and Microsoft have the good sense to encourage them in their career.