640KB 100% are enough for everyone. I am talking about the Task Manager % CPU Utilization Graph and why it’s 100% Utilization in the overall CPU % Utilization view is sometimes wrong, often misleading, and nearly aways useless for performance troubleshooting, although working correctly according to the specs.
Update2
The fixed Task Manager has arrived in Windows 11 24H2 and later. CPU utilization is now representing actual core usage in CPU graph. CPU Utility was added to Details view and is no longer capped at 100%.
This is great news for everyone who wants to quickly see if some process is using more CPU than it should without the exaggeration of frequency scaling.
We are beginning to roll out a change to the way Task Manager calculates CPU utilization for the Processes, Performance, and Users pages. Task Manager will now use the standard metrics to display CPU workload consistently across all pages and aligning with industry standards and third-party tools. For backward compatibility, a new optional column called CPU Utility is available (hidden by default) on the Details tab showing the previous CPU value used on the Processes page.
Thanks!
Update 1 End
I have on my HP Laptop an i5-1345U CPU which has 2 Hyper threaded Power Cores and 8 Efficiency Cores which in total accounts to 10 Cores and 12 Logical Processors (+2 for Hyper threaded cores) according to Task Manager. The CPU % Utilization value is 100% which means that all 12 Logical Processors are working at 100%. Really?
Let’s see more by switching from the Overall Utilization to Logical Processors view:
5 Cores are at 100%, but 7 cores are nearly Idle. That is unexpected and the cause for a lot of confusion when someone uses Task Manager to check for CPU hogs. 100% Utilization in Task Manager does not mean that all Logical processors are in use, as one (naively) would expect it to be the case.
What really bugs me is that even with the latest Windows 11 version, this issue has not been addressed.
If 100% Utilization does not mean 100% of all Logical Processors are in use, what is displayed here? Let’s take a sharper look, what is visible in Task Manager when I max out 5/12 Logical Processors with the CPUSTRESS tool from Sysinternals.
Calculating the % Logical Processors Utilization (which is the % Processor Time performance counter value) we arrive at 5/12*100% = 41%. This value is shown in the Details view of Task Manager. If you sum up the numbers in the Details view, you will arrive at ca. 43% in total. But Task Manager shows 100% Utilization in the CPU overview. These inconsistent numbers in task manager (by designTM ) get worse with every new CPU generation and makes the overall CPU graph useless. I have asked some colleagues if they have noticed these inconsistencies. Yes they do, but assumed they are just not smart enough to make sense of the displayed data.
In effect Task Manager always shows 100% CPU Utilization starting at 41% Logical Processor Utilization. The main task of Task Manager: To check if something uses too much CPU becomes much harder than it should be.
Ok the number is not meaningful, but it is correct. What do we see here? The value visualized by Task Manager in Overall % CPU Utilization is the performance counter % Processor Utility which is different from the values shown in the Details view which is using % Processor Time. Its description is:
Processor Utility is the amount of work a processor is completing, as a percentage of the amount of work the processor could complete if it were running at its nominal performance and never idle. On some processors, Processor Utility may exceed 100%.
If the CPU goes into any Turbo/Throttle mode, Task Manager will show a scaled value which is proportional to the current clock rate. Based on my measurements, it seems to be roughly:
The intention of this value is that if the CPU clocks higher an application can run faster. It reflects the actual CPU processing performance. While well-meaning there is a big issue with today’s CPUs which aim for power efficiency. The base frequency gets lower, but the turbo frequencies get higher, making the scaling factor (fCurrent/fBase) large. If e.g. you have on a 10 core system one core fully utilized you have 10% utilization (%ProcessorTime). When the multiplier fCurrent/fBase is huge e.g. 10 then a single turbo boosted core would show as 100% Processor Utility in Task Manager which is correct by design but misses the point that people use Task Manager to identify CPU hogs.
My current laptop CPU has a low 1.6 GHz base frequency, but can boost up to 4.12 GHz. This is a factor 2.5 more than the base frequency. The displayed value in the graph view of Task Manager is the Logical Processor Utilization 41% (5/12) multiplied with the frequency factor of 2.5 which results in 102%. Yes this is bigger than 100%. Utilization Values > 100% are unintuitive to the average user, so Microsoft decided to display an at 100% capped value in Task Manager. This is the reason it shows 100% CPU Utilization at >= 41% Logical Processor Utilization, but the value will not increase anymore if more CPUs are in use.
If you visualize the Performance Counters
% Processor Time (Unscaled per core utilization used everywhere else in Task Manager 100% is max)
% Processor Utility (Frequency scaled core utilization used in CPU Overall Utilization view can go significantly above 100%)
in Perfmon you find that the % Processor Utility is not capped at 100% and rises at the beginning and then drops a bit because the CPU is getting hot and thermal throttling kicks in which reduces CPU frequency. Below is data for a different CPU (I5 13600 KF) shown:
When all cores are in use, I want to see it in Task Manager easily in the overall CPU Utilization graph. The per-core view is cumbersome — if not all cores are active, interpreting the graph becomes difficult, and the 100%-capped Processor Utility number adds even more confusion.
To make things even more complicated, there are since several years Power and Efficiency cores used by Intel mainstream CPUs which differ in performance roughly by a factor 3. P and E cores are also usually running on different top clock speeds, which makes calculation of an average frequency of busy cores challenging. Added complexity enters the arena with smart per core scheduling by the OS based on policies which can prefer P/E cores or a mixture of both.
The original goal to visualize application speed with a frequency scaled CPU Utilization view has become a meaningless metric on today’s CPUs.
The best solution in my point of view is to switch back to the old-fashioned % Processor Time graph and be done with it. Adding more smartness will just cause more confusion and inconsistencies across the displayed numbers in Task Manager.
A single, meaningful number representing application speed might be possible, but calculating it would be highly complex. Several factors must be considered, including:
Core Type (P-core vs. E-core)
Maximum Core Frequency
Performance varies significantly between P-cores and E-cores depending on the instruction set used, with a rough scaling factor of about 3. Additionally, maximum core frequency is influenced by factors like CPU cooling efficiency and ambient temperature.
This complexity makes it difficult to create a CPU performance graph that accurately represents application performance on a 0-100% scale. Furthermore, it’s a moving target—what qualifies as “100%” can vary based on operating conditions. For instance, a CPU running at 0°C vs. 40°C may experience different thermal throttling states, affecting frequency and overall speed.
Now you see why Task Manager can be confusing. Let’s hope MS makes this not even more rocket science and gives performance practitioners easily understandable, uncut and consistent diagnostics.
One of the harder lesson during development is how to deal with leaks. Most of the time you are leaking memory but sometimes also OS resources like Handles. These are more difficult to detect because the application does not use excessive amounts of memory when leaking OS resources. Leaks are not just wasteful; they can also be a source of hard-to-spot bugs. Below is a list of handle types that can be leaked. There are many uncommon handle types, primarily used by Windows itself:
ALPC Port
Composition
CoreMessaging
Desktop
Device
Directory
DxgkCompositionObject
DxgkDisplayManagerObject
DxgkSharedResource
DxgkSharedSyncObject
EnergyTracker
EtwConsumer
EtwRegistration
Event
File
FilterCommunicationPort
FilterConnectionPort
IRTimer
IoCompletion
IoCompletionReserve
Job
Key
Mutant
Partition
PcwObject
PowerRequest
Process
RawInputManager
Section
Semaphore
Session
SymbolicLink
Thread
Timer
TmEn
TmRm
TmTm
TmTx
Token
TpWorkerFactory
UserApcReserve
WaitCompletionPacket
WindowStation
WmiGuid
Most of the time, you are leaking event, file, process, and thread handles. Common issues include file sharing violations when you forget to close a file before opening it again. If you forget to close thread or process handles, you will see process/thread IDs reaching the range of millions. Named events, if not closed properly, can lead to hard-to-spot race conditions. When you create a named event with the same name again, you are merely reopening the existing one, which might already be in an unexpected signaled state. This can result in unexpected application behavior.
So how can you track down handle leaks? There are a few commercial options that hook into the relevant APIs, but Windows has a built-in mechanism to track handle leaks using Event Tracing for Windows (ETW).
Since Windows 10, Windows Performance Recorder (WPR) is included, allowing you to start tracking your handle leaks right away. However, there’s a problem: the WPR version in Windows 10 is broken and often returns the infamous error message when you stop profiling:
wpr -stop problem.etl
Cannot change thread mode after it is set.
Profile Id: RunningProfile
Error code: 0x80010106
That approach works, but the main problem with handle tracing is that over 90% of the handle data is related to events, which quickly fills up your ETW in-memory ring buffers. ETW is usually configured to record into memory ring buffers, which cannot exceed about 10% of your physical RAM. This limitation severely impacts the usefulness of ETW handle tracing if you need to track handles over an extended period (e.g., hours).
However, there is a solution. My recording UI, ETWController (available on GitHub: ETWController), includes a handle type filter that allows you to limit handle tracing to a specific handle type. To my knowledge, this functionality is not yet exposed via xperf or WPR. This filtering capability can help you effectively manage the amount of traced data and track handle leaks more efficiently over extended periods.
That allows you to record for not so common handle types for a very long time. If you need to record even longer, you can modify the supplied Handle profile to remove the CloseHandle stackwalks which will reduce the amount of traced data by ca. 30-40%. Stack traces for an ETW event are usually much larger than the actual traced data.
So, how do you know which handle type you are after? The great Process Explorer clone, System Informer, can help. Here’s how:
Open System Informer.
Double-click on the process you’re interested in.
Select the “Handles” tab.
Click on the “Options” menu and choose “Statistics”
This will provide you with a handle type summary, which is a good first step in identifying the leaked handle type. This summary helps you determine which specific handles to focus on, making it easier to use tools like ETWController for more precise tracking and troubleshooting.
After having recorded the data you can track down your leaks in WPA where you can add Handles to your analysis pane from the Memory view.
A useful grouping is Creating Process, Handle Type, Object Name, Lifetime, Close Stack, Create Stack left of the yellow line. That way, you can quickly find all not closed (aka leaked) handles and directly go to the leaking stacks.
That works nicely, but sometimes you need to find correlations between different handles and look at the timing. My tool ETWAnalyzer converts the binary ETL file into compressed Json files which are much more accessible but still hard disk friendly. To extract from an ETL file handle data you can use:
ETWAnalyzer -extract ObjectRef -fd yourETW.etl -symserver ms
C>ETWAnalyzer -extract ObjectRef -fd c:\temp\Boot_HandleTrace_MAGNON.etl -symserver ms
1 - files found to extract.
Success Extraction of c:\temp\Extract\Boot_HandleTrace_MAGNON.json7z
Extracted 1/1 - Failed 0 files.
Extracted: 1 files in 00 00:02:37, Failed Files 0
Then you can query the data with ETWAnalyzer in console mode
To see all handle events (-Dump ObjectRef) of explorer (-pn or -ProcessName) for the registry key \REGISTRY\USER\S-1-5-21-3592500153-2310523904-3374296524-1001 (-ObjectName) end up with this query:
which tells you that 5 kernel objects and 5 handles have not been closed during the recording. To show just the leaking handles you can add -Leak to the previous query which will just show the events which were not yet closed during the recording:
To view the stack traces of the pending allocations, add -ShowStack and you have your leaking stacks
You need to check the stacks if that is a real leak or simply a case of normal application behavior. For 5 handles it is most likely not a leak and explorer just did not yet get around to release them yet. But if you loose thousands of handles in such stacks, then you might have found a leak.
Handles reference kernel objects which are reference counted. You can open an existing kernel object (e.g. a named event) and just increase the reference count by one. The underlying kernel object is only released when the last handle reference is closed.
ETWAnalyzer shows object activity sorted by time grouped by
Handle Create
Handle Duplicate
Handle Close
calls. This makes it easy to spot handle leaks caused by incorrect threading where you e.g. have a data race and assign a just created handle value to a memory location while another thread did the same. Due to missing locking, you will “forget” aka leak the other handle which is then never closed. The code below is an example of such a data race if you call GetOrCreateEvent from multiple threads:
class EventManager
{
private:
HANDLE m_Event;
int m_Id;
public:
HANDLE GetOrCreateEvent()
{
// Here is a lock missing!
if (m_Event == nullptr)
{
m_Event = CreateEvent(m_Id);
}
return m_Event;
}
~EventManager()
{
if (m_Event != nullptr)
{
BOOL lret = ::CloseHandle(m_Event);
m_Event = nullptr;
}
}
....
What are hybrid CPUs? Hybrid CPUs are CPUs which have fast, but power hungry cores and slower but less energy consuming cores integrated into a CPU. All consumer grade desktop CPUs from Intel since Alder Lake (CPU number 12xxx or greater which was released in 2022) are hybrid CPUs. If you have recently bought an Intel CPU the chances are nearly 100% that you have bought a hybrid CPU.
In theory, everything should run faster with newer CPUs. But in practice, there are issues, if you want to run software which was not prepared for hybrid CPUs. E.g. if you need to replace old machines with newer ones which run your software on hybrid CPUs which still needs Windows 10.
There are many great reviews online which talk about the performance of hybrid CPUs on Windows. Some have also noticed that there are issues with processes with below normal priority:
Intel gave an example of a content creator, exporting a video, and while that was processing going to edit some images. This puts the video export on the efficiency cores, while the image editor gets the performance cores. In my experience, the limiting factor in that scenario is the video export, not the image editor – what should take a unit of time on the P-cores now suddenly takes 2-3x on the E-cores while I’m doing something else. This extends to anyone who multi-tasks during a heavy workload, such as programmers waiting for the latest compile. Under this philosophy, the user would have to keep the important window in focus at all times. Beyond this, any software that spawns heavy compute threads in the background, without the potential for focus, would also be placed on the E-cores.
Personally, I think this is a crazy way to do things, especially on a desktop. Intel tells me there are three ways to stop this behavior:
Running dual monitors stops it
Changing Windows Power Plan from Balanced to High Performance stops it
There’s an option in the BIOS that, when enabled, means the Scroll Lock can be used to disable/park the E-cores, meaning nothing will be scheduled on them when the Scroll Lock is active.
Many CPU hungry number crunchers like compilers, video editing, ray tracing and other software are trying to be nice to the user and lower the priority of their threads or computation child processes to let the user experience a still fluent system, although the CPU is doing hard work on all cores. That approach has worked for a long time pretty well, but it has some unintended side effects which are not well understood when hybrid CPUs enter the game.
Windows 10
Let’s do a simple experiment. We will use Windows 10 and CPU Stress from Sysinternals where we create 7 threads which run inside a Below Normal process. I use 7 because some things might already be running, so I give it some headroom on the 8th E-Core.
When we observe the CPU load in Task Manager we find that all threads of a Below Normal process run on the E-Cores only (the last 8 ones for I7-13700). If we raise the process priority back to normal, the load is scheduled on the P cores and all is fast if the load was e.g. an expensive compile or video editing step.
But we can achieve the same effect by simply selecting the CPU Stress window. That will cause Windows to raise the CPU priority of the foreground process, and suddenly mitigation option 1 mentioned by Dr. Ian Cutress makes sense. If the application has multiple processes which form a UI you might just put the focus on the right one to speed up things again.
Disabling the E-Cores (Mitigation 3) is possible in some but, not in all BIOS versions. What most offer is that you can forcefully park the E-cores, so no work is scheduled to them. In my BIOS this is called Legacy Game Mode which is activated with the Scroll Lock hot key.
That fix does work but you will need to remember always to toggle the hotkey to turn off the E-Cores. Besides the usability issue the main problem is that real multithreaded applications which utilize e.g. all cores will experience large slowdowns because they suffer now from thread oversubscription. The application sees 20 cores (e.g. I5-13600K) but due to core parking 20 threads are squeezed on 12 cores which means that 8 threads will compete to run on the same cores leading to thread contention.
Power Plans
Switching to the High Performance Power Plan (Mitigation option 2) works. What is not explained why this works. Or what are the differences between the Power plans on Windows? On the surface it looks easy. You have 3 default plans
Power Saver
Balanced
High Performance
and you can choose one of them.
By default there are only two settings visible in the UI
Minimum/Maximum Processor State
and that’s it. But in reality on a current Windows 11 (23H2) machine you have 75 different settings. Are all of them relevant? No. By far most of the settings are from Intel which configure SpeedStep/SpeedShift settings for quicker voltage/frequency changes than the OS is able to do when the CPU load changes.
That are many settings. What are the differences between the three main power plans? I have updated ETWAnalyzer to answer that question.
To dump all files with the file name Win10 printing only the properties which have differences
ETWAnalyzer -dump power -fd win10 -diff
Then you get a nice list of the changes. Now we also know why changing the Power profile from Balanced to High Performance will solve the E-Core oversubscription issue:
The High Performance Profile has as Hetero Thread Scheduling Policy PreferPerformantProcessors enabled instead of Automatic which apparently means prefer E-cores. That’s the setting which fixes the Low priority process CPU problem on Windows 10.
Since we can compare power profiles it is now also easy to compare them between Windows 10 and 11 by exporting the data to CSV with ETWAnalyzer. After having a lot of columns I just needed to transpose the table and arrived at this:
Orange marked entries are new on Windows 11 which did not exist on Windows 10. Red fields are values which have changed in Windows 11. There are quite a few changes visible, but only a few of them make a measurable difference with simple load tests.
E.g. the main settings which changes everything between Balanced and Power Saver is the MaxThrottleFrequency % Class 1 or Maximum processor state for Processor Power Efficiency Class 1 which defines how high the P-Cores can boost. The Power Saver profile sets that to 75%. Similar is the MaxEfficiencyFrequency MHz Class 1 or Maximum Processor frequency for Processor Power Efficiency Class 1 setting, which let’s you control the P-core frequency directly. If set, it will override the MaxThrottleFrequency % setting.
Windows 10/11 Scheduling Differences
The main knobs in the power profile are
Heterogenous policy in effect
Values 0-4
Default is 4 on Windows 10
0 Windows 11, except for Power Saver profile where it is 4.
Heterogenous thread scheduling policy
Values 0-5
Default is Automatic on Windows 11
PreferPerformantProcessors or Automatic on Windows 10 depending on used Power Profile
0=AllProcessors
1=PerformantProcessors
2=PreferPerformant
3=EfficientProcessors
4=PreferEfficient
5=Automatic
I have created a simple load test which runs on 12 cores for 10s which runs at 3 process priorities (BelowNormal, Normal, AboveNormal). Then I configure in a script all possible combinations for both polices and let them run. Here is the code of my simple LoadStress app:
namespace LoadStress { internal class Program { static void Main(string[] args) { var total = Stopwatch.StartNew(); Action acc = () => { var sw = Stopwatch.StartNew(); while (sw.Elapsed.TotalSeconds < 10) { } };
var delegates = Enumerable.Repeat(acc, threads).ToArray();
Parallel.Invoke(delegates); total.Stop(); Console.WriteLine($"Did start {threads} threads in process with priority {prioEnum} in {total.Elapsed.TotalSeconds:F1} s"); } } }
The key insight I did use was that I want to measure CPU scheduling and not CPU performance. I simply start n threads and do a while < 10s busy loop which is independent of the CPU frequency and type. That way I will always consume the same amount of CPU time regardless on which CPU type I am running I know that one thread will have consumed, if no thread contention did show up, 10 s.
One run will modify the current power profile (I did use Balanced) and ran the LoadStress application.
The test application consumes 120 s of CPU. Both tested CPUs (I5 13600 and I7 13700) have 8 E-cores and 8 and 6 P-cores with Hyper Threading enabled resulting in a total core count of 20 and 24. What is the most performant thread scheduling for 12 threads? In that case 8 P-cores should be used and 4 E-cores, or just P-cores are preferred then the load goes completely on 12 P-cores. Since E-cores are by a factor 2-3x slower than the P-cores, but if you put concurrent load on Hyper threaded P-cores then you are also not getting much faster than on a single P-core. In the end it depends on the specific load type if you gain anything from mixed P/E-core scheduling or not.
Based on that I would expect a total CPU consumption of 120 s if all is optimal 80 s load on the P-Cores and up to 40 s on the E-cores if the hyper threaded cores are skipped in favor of the E-cores.
If the combined CPU consumption of P/E-cores is below 120 s we have thread starvation issues because OS schedules multiple threads onto the same core leading to thread contention. If you are reading this you are usually aiming at 80-120 s of CPU spent on the P-cores and the remaining 40-0 s spent on the E-cores. Let’s check how Windows 10 and 11 behave when we change the settings and use either the Balanced or the High Performance profile as base profile while we are switching Hetero Policy and Hetero Thread Policy and Process Priority. This is already quite a matrix of 5*6*3*2 = 180 combinations which is difficult to visualize with any chart type. In the end I opted for a simple table with embedded graphs over the actual values which gives most insights.
If you aim to force the workload on Windows 10 of Below Normal processes on the P-cores you should use Hetero 3 which prefers Below Normal processes on the P-cores with the highest value of 110s/120s.
Using the High Performance profile as it was suggested by Dr Ian Cutress solves the problem only partially which use Hetero 4 and Hetero Thread scheduling PerformantProcessors. The many other configuration knobs did not make a measurable difference for our synthetic non spiked load.
Normally we would expect a total CPU of 120 s running on 12 or more cores whic is nicely visualized in WPA by logical CPU number.
But sometimes on Windows 10 strange things happen when we confine Hetero Thread scheduling to Efficiency or Performance cores. In this case not 12 cores were used but only 2 during the 10s runtime with 12 threads:
Why this is happening I am not sure. If you use the default Windows 10 policy Hetero 4 and confine the load via Hetero Thread scheduling on specific core types you can be penalized by Windows 10 scheduler. See Windows 10 – Hetero 4 – PerformantProcessors for AboveNormal and Normal processes where we get in the worst case only one core assigned during application runtime. Why is this an issue?
Because these are the standard settings for the High Performance Power Profile, which is the recommended solution as mitigation for the E-core preference in the Balanced Power profile! My tests prove that the recommended solution might make matters worse on Windows 10.
The used Hetero Profiles 0-4 are not documented by MS, but Intel did publish for 0 and 4 some documentation:
Setting Value: 0 (i.e., Standard Parking or Favored Core Parking) In this configuration, the optimum set of compute cores are unparked starting with the most performant cores first.
SettingValue: 4 (i.e., Hetero Parking) In this configuration, based off utilization, a combination of most performant or most efficient cores are unparked first. In certain scenarios like low power envelope SKUs or better battery life goals, it can be more efficient to run low utilization work on cores with higher efficiency capability at efficient frequency. This policy is used in these scenarios in combination with optimal performance state engine settings.
Based on my measurements we can summarize the scheduling behavior of the 5 Hetero Profiles on Windows 10 with
Hetero 0 Distribute load accross all cores with a preference on P-cores (Default on Windows 11).
Hetero 1 Same as 0.
Hetero 2 Confine work entirely to E-cores
Hetero 3 Distribute load on P-cores stronger than Hetero 0, but allow E-core usage.
Hetero 4 Prefer E-cores for low priority tasks and constrain load on some core types when PerformantProcessors or EfficientProcessors are used as Hetero Thread scheduling option (Default on Windows 10).
So what is best? Hetero Thread Scheduling Automatic will load our Below Priority process to the E-cores which is the default in Balanced Power Profile. The numbers look stable, just the wrong cores are used:
Now go back to the High Performance Profile defaults
We get P-cores but erratic scheduling where due to unkown reasons not all needed 12 P-cores are utilized. This issue happens sporadically and not all the time. If you have short running tests all will look fine but if you let the 10s LoadTester run 20 times you find erratic scheduling. Based on measurments I would say this is a bug in the Kernel Scheduler which is affecting CPU heavy workloads which use many cores. If the load is constant then it will settle after some time but the ramp up phase until the kernel gives all needed cores is many seconds as the tests have shown. This issue should hurt short lived processes which employ many threads on Windows 10 on hybrid CPUs with the High Performance Power profile.
Solution
Currently the best way out seems to use
Hetero Policy 3
Hetero Thread Policy AllProcessors
This will schedule first all P-cores and then will overflow the rest to the E-cores which should give you a better mix than the default random shuffling which sometimes happens with the default profiles. Background processes are not unimportant to the user just because they have lowe priority. Perhaps they are running the video AI enhancement or compile step in the background just because they do not want to freeze your machine and lower the process priority.
Windows 11
So far I have talked mostly about Windows 10. What about Windows 11? The results are much less dramatic thanks to the Intel Thread Director. From the numbers it looks like there is not much difference between the Hetero settings at all. The only impactful setting which is strictly adhered to is the Hetero Thread Policy EfficientProcessors which will confine the load entirely on the E-cores. A deeper inspection will need to wait for another day. Since this beast is quite smart, it detects specific cpu instructions like pause/mwait and other things which consume cycles but do not need to be fast which can safely be put to E-cores, a lot of different workloads would need to be tested which is outside the scope for a simple blog post.
Conclusions
I do not want to believe. I want to measure things and verify if the claims are true. Profiling with ETW is extremely helpful to find later what did cause unexpected anomalies during measurements. In total I did generate 600 ETW files with a size of 90 GB. After extraction with ETWAnalyzer 800 MB of Json files were left, which are much quicker to query than to open and look at 600 files with WPA. To find e.g. all .etl files which have scheduling issues with specific settings I just need to query all LoadStress processes and filter for processes which have CPU > 115 s. All files with smaller CPU will show up as 0 CPU which are the interesting ones. Good naming of the files is essential to make such an approach work. But once you have the data it is easy:
EtwAnalyzer -dump CPU -pn LoadStress -ShowTotal Total -fd *_Normal* -MinMaxCPUMs 115s
That simple query immediately shows that the Windows 11 thread scheduling is rather unspectacular in terms of thread contention. The only case where thread contention happens is when the load is confined to efficient processors. That would be the case for our load because we use 12 threads, but we have only 8 E-cores.
Power profile settings are a deeply underexplored topic where are many myths are circulating in the (gaming) forums. I hope that the provided tooling will serve as a foundation for future quantitative measurements performed by other people. Many people seem to like to change some switches and believe they have solved a performance problem. Without measuring you will never be able to tell. Your gut feeling is not exact enough to serve as foundation to solve performance issues. Only by measuring and visualizing the previously unknown, you will find new patterns and correlations which will help to understand your system better. Your mental performance model is always too simple compared to what is really happening inside your computer.
That’s all for today. Now start measuring at your own!
Starting with Windows 10 KB 5025221 I had issues in Outlook to attach files to mails. Drag and Drop did become unreliable and would sometimes silently do nothing. When I tried the Attach File dialog it would most of the time work, but sometimes fail with:
That was annoying but not really problematic. More serious was that somtimes I could not open attached PDF files. This resulted in a file not found error box. The issue was sporadic enough and an Outlook restart did always fix the issue. But sometimes I did end up with empty files
which was not so good. Also some mails the attachements (PDFs) were set to 0 bytes which was a real data corruption.
Reason enough to drill deeper what was going on. The first thing I would take is a procmon trace. If someone interferes (Antivirus is always a good reason) we should see some access denied or other file deletion events there. After a first look I did not find a single failed IO operation in the procmon traces which was strange.
At least I did learn how Outlook works. When you double click an attachement in Outlook it copies the data from the mail database .pst/.ost file to a temp file located in
Then it copies the file again into …\Src (002).pdf which is the file which is then opened by the actual PDF reader.
When the reader is not a reader but a writer then Outlook will check if the second copy is different to the first copy and will ask if you want to save the changes in Outlook. If you do this manually you still can go back by asking Outlook for a earlier version which is still present. So far so good. But in the corrupted mails no previous go back button was present.
Drag and drop a file from Explorer into a new mail works similar. First the file is copied to %temp%\xxxx.tmp from there it is copied again to C:\Users\%username%\AppData\Local\Microsoft\Windows\INetCache\Content.Outlook\????\YourFile.pdf and then it is finally stored in your mail database .pst/.ost file.
Outlook is copying quite a few files around which is the reason why it uses the default Windows CopyFile method. When looking deeper with ETW tracing I could compare a good vs a bad attach run which did differ in the duration of the CopyFile operation which mysteriously did fail and did run much shorter than it did in the good case
The good thing is since a few month Microsoft delivers public symbols also for its office products which read much better than Ordinal+xxxx stack traces. Since it is not possible to see more with ETW profiling one would need to debug Outlook. The good thing is that this is possible now with the recently released new Windbg which allows you to attach to a running process and capture executed CPU instruction and accessed memory. The way how this works is fascinating. At debugging start a full memory dump is taken, then CPU tracing is enabled which captures the complete CPU instruction stream. The end result is not a static memory dump but a memory dump where you can debug forward and backwards! For more infos see https://aloiskraus.wordpress.com/2017/09/26/it-is-time-for-time-travel/. With that I could record a failed attach incident and send to MS support for further analysis where it did fail. The answer was that a regression did happen in the mentioned KB update which did kick in when the following conditions did occur:
A pointer to NtCreateFile with a pointer address > 2 GB is passed
When these things did happen the CreateFile call did bail out early even before procmon or other filter drivers where able to catch the return code of the failed CreateFile call. Quite a nasty thing which also explains why it did take so long to find out.
With that knowledge we can examine the issue a bit further. We can write a small test application which calls CopyFile in a 32 bit process which has allocated more than 2 GB of memory. The easiest way to achieve this is to use a few lines of C# like this
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
namespace FileCopyBug
{
internal class Program
{
static void Main(string[] args)
{
if (args.Length < 2)
{
Console.WriteLine("FileCopyBug src dest [-noalloc]");
return;
}
Queue<string> queue = new Queue<string>(args);
bool bAlloc = true;
string src = null;
string dest = null;
while (queue.Count > 0)
{
string arg = queue.Dequeue();
if (arg == "-noalloc")
{
Console.WriteLine("Skip allocations.");
bAlloc = false;
}
if (src == null)
{
src = arg;
}
if (dest == null)
{
dest = arg;
}
}
if (bAlloc) // allocate > 2 GB of memory to force pointers > 2 GB
{
int n = 0;
const int ChunkBytes = 500; // allocate small to not leave holes
const int nMax = 4600000;
while (++n < nMax)
{
Marshal.AllocHGlobal(ChunkBytes);
}
}
Console.WriteLine($"Allocated memory: {Process.GetCurrentProcess().PrivateMemorySize64:N0} bytes");
// Call KernelBase.dll CopyFile which exhibits the bug when pointers > 2 GB are used for Extended Attributes to NtCreateFile which will bail out with an access violation c0000005 as NTSTATUS return code
File.Copy(args[0], args[1], true);
}
}
}
The compiled binary and code can be download from my OneDrive here. If you are running Windows 10 with the latest Updates you can try to copy one file to a new location just like with any copy tool:
c:\tmp>FileCopyBug.exe src.pdf dsg.pdf
Allocated memory: 2,485,137,408 bytes
Unhandled Exception: System.IO.IOException: Invalid access to memory location.
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
at System.IO.File.InternalCopy(String sourceFileName, String destFileName, Boolean overwrite, Boolean checkHost)
at System.IO.File.Copy(String sourceFileName, String destFileName, Boolean overwrite)
at FileCopyBug.Program.Main(String[] args)
That is really an Access Violation code which was returned by NtCreateFile which was correctly returned by CopyFile. When you add the parameter -noalloc then no additional memory is allocated and the copy operation will work. I have a machine with the Jan Win 10 version around where the FileCopy issue is not present. This also works on Windows 11.
Since we have come so far we can debug now our test application directly and see where this leads.
The steps are
Create a breakpoint for CopyFile (bp kernelbase!CopyFileW)
Run until we enter CopyFile (g)
Create a breakpoint for NtCreateFile (bp ntdll!NtCreateFile)
Run until we hit NtCreateFile for the source file (g)
Print file name dt ole32!OBJECT_ATTRIBUTES poi(@esp+3*4)
Leave method to check return code (gu), return code is in eax. 0 is success.
Run until we hit NtCreateFile for the destination file (g)
Print file name dt ole32!OBJECT_ATTRIBUTES poi(@esp+3*4)
Leave method to check return code (gu), return code is in eax with c0000005 which is the error we are after.
0:001> bp kernelbase!CopyFileW
0:001> g
Breakpoint 0 hit
Time Travel Position: FBCD:1ED4
eax=28c67690 ebx=02af91b4 ecx=82e96558 edx=02b50000 esi=5a13bdb0 edi=02afa810
eip=76a137f0 esp=02af5028 ebp=02af52b8 iopl=0 nv up ei pl nz na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00200206
KERNELBASE!CopyFileW:
76a137f0 8bff mov edi,edi
0:006> bp ntdll!NtCreateFile
0:006> g
Breakpoint 1 hit
Time Travel Position: FBCE:72E
eax=02af4900 ebx=00000000 ecx=00000005 edx=02af474c esi=80000000 edi=80000000
eip=77c13000 esp=02af45ac ebp=02af4f8c iopl=0 nv up ei ng nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00200282
ntdll!NtCreateFile:
Unable to load image C:\Program Files (x86)\Microsoft Office\root\Office16\AppVIsvSubsystems32.dll, Win32 error 0n2
77c13000 e96b11cfdf jmp AppVIsvSubsystems32!vfs_hooks::hooked_NtCreateFile (57904170)
0:006> dt ole32!OBJECT_ATTRIBUTES poi(@esp+3*4)
+0x000 Length : 0x18
+0x004 RootDirectory : (null)
+0x008 ObjectName : 0x02af4864 _UNICODE_STRING "\??\C:\Users\...\AppData\Local\Microsoft\Windows\INetCache\Content.Word\src.pdf"
+0x00c Attributes : 0x40
+0x010 SecurityDescriptor : (null)
+0x014 SecurityQualityOfService : (null)
0:006> gu
Time Travel Position: FBD2:640
eax=00000000 ebx=00000000 ecx=579026f4 edx=02b50000 esi=80000000 edi=80000000
eip=76a1521e esp=02af45dc ebp=02af4f8c iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00200246
KERNELBASE!BasepCopyFileExW+0x187e:
76a1521e 898584f9ffff mov dword ptr [ebp-67Ch],eax ss:002b:02af4910=00000000
0:006> g
Breakpoint 1 hit
Time Travel Position: FBED:77
eax=02af2b8c ebx=00000000 ecx=00000005 edx=10000044 esi=c0150081 edi=00002020
eip=77c13000 esp=02af25d8 ebp=02af4598 iopl=0 nv up ei ng nz na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00200286
ntdll!NtCreateFile:
77c13000 e96b11cfdf jmp AppVIsvSubsystems32!vfs_hooks::hooked_NtCreateFile (57904170)
0:006> dt ole32!OBJECT_ATTRIBUTES poi(@esp+3*4)
+0x000 Length : 0x18
+0x004 RootDirectory : (null)
+0x008 ObjectName : 0x02af27a8 _UNICODE_STRING "\??\C:\Users\...\AppData\Local\Microsoft\Windows\INetCache\Content.Outlook\XNKF02BJ\src (002).pdf"
+0x00c Attributes : 0x40
+0x010 SecurityDescriptor : (null)
+0x014 SecurityQualityOfService : 0x02af3f18 Void
0:006> gu
Time Travel Position: FBFA:1B8
eax=c0000005 ebx=00000000 ecx=579026f4 edx=02b50000 esi=c0150081 edi=00002020
eip=76a178c2 esp=02af2608 ebp=02af4598 iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00200246
KERNELBASE!BaseCopyStream+0xd01:
76a178c2 8985e0e5ffff mov dword ptr [ebp-1A20h],eax ss:002b:02af2b78=00000000
The nice thing about the x86 WinAPI calling convention is that all parameters are on the stack. To see all passed parameters to an exported method of the signature which has 11 arguments
__kernel_entry NTSTATUS
NTAPI
NtCreateFile (
OUT PHANDLE FileHandle,
IN ACCESS_MASK DesiredAccess,
IN POBJECT_ATTRIBUTES ObjectAttributes,
OUT PIO_STATUS_BLOCK IoStatusBlock,
IN PLARGE_INTEGER AllocationSize OPTIONAL,
IN ULONG FileAttributes,
IN ULONG ShareAccess,
IN ULONG CreateDisposition,
IN ULONG CreateOptions,
IN PVOID EaBuffer OPTIONAL,
IN ULONG EaLength
);
I can break on method enter and then dump the next 12 DWORDs (the first value is the return address) from the stack pointer esp I can dump directly the passed method arguments:
We are looking at a Windows 10 machine which makes me wonder why I find a FILE_CONTAINS_EXTENDED_CREATE_INFORMATION flag which is according to the docs a Windows 11 only feature. It looks like that Antivirus can scan less when it is clear that this is a local file. It looks like this performance feature was backported to Windows 10 with the April update. To check that we can dump the EaBuffer as a EXTENDED_CREATE_INFORMATION structure.
That sure looks like a valid structure with the EX_CREATE_FLAG_FILE_DEST_OPEN_FOR_COPY flag set. Now we have also found the EaBuffer pointer which is > 2 GB (0x80000000) which is causing NtCreateFile to return (wrongly) an Access Violoation return code.
Conclusions
The issue presented here is not specific to Outlook 32 bit but it can affect all 32 bit applications which make use of CopyFile/MoveFile. I have not studied the effect to Word/PowerPoint/Excel but similar issues might occur. At least you know now how you can find out if you are affected when you use a debugger. An alternative approach would be to enable ETW SysCall tracing which allows you to see every return code of all called kernel methods. Then you can scan for NtCreateFile and a return code of 0xC0000005 which should never happen.
Lets hope that soon a fix is released to make these strange sporadic failures go away.
Sometimes I get ETW traces which cannot be loaded by WPA. Initially I found these on Windows Server 2022 on a VMWare machine. I suspected some issue how VMWare virtualizes things on newer OSes but I did get this error also on my new non virtual Windows 11 box. After asking around it turned out that this is side effect of .NET 7 processes. When a .NET 7 process is running and you use WPR profiles you will get unreadable traces back.
When you try to process them with TraceProcessing Library you get something like
C:\Temp>etwanalyzer -extract all -fd test.etl
1 - files found to extract.
Error: One or more errors occurred. (The request is not supported. (0x80070032)) Check C:\Utils\ETWAnalyzer\ETWAnalyzer_Trace.log for full details, or use -debug switch to get full output.
Error: EventSink {39cda801-59d8-472d-bc00-601a4150c4c4} returned HRESULT 0x80070032 on:
Event#: 294800 (T#0:#294800)
TimeStamp: 6644961, Process: 12172, Thread: 8008, Cpu: 10
ProviderId: {a669021c-c450-4609-a035-5af59af4df18}
EventDescriptor: 0x0096 0x0001 0x2a 0x01 0x00 0x05 0x0000000000020000
UserDataLength: 405
The default wpr profiles are affected, but for some reason my custom WPR Profiles (https://github.com/Alois-xx/FileWriter/blob/master/MultiProfile.wprp) are not. The only way to solve this is to use either custom WPR profiles or to kill all .NET 7.0 processes. But which ones?
With ETWAnalyzer it is easy
Etwanalyzer -fd c:\temp\extract\ManySlowCalls_MAGNON.03-07-2023.23-21-41.json -dump version -dll *coreclr* -versionfilter *7.0*
Ahh PhoneExperienceHost.exe is the bad guy. As workaround all you need to do is to kill PhoneExperienceHost and you should get analyzable ETW traces again.
A future WPA/TraceProcessing version should contain the fix, but until then you need to use one of the workarounds.
From a performance point of view Antivirus scanners are the worst things that have happened to computers during the last ca. 5 years. Sometimes it is hard to differentiate the CPU activity of an AV scanner from a bitcoin miner. Both are eating your CPU to give you either virtual security or virtual money.
Whenever something is mysterious and slow I get a call. In this case a beefy Windows Server was working happily, but a few times a day everything would freeze for up to 5 minutes. The solution to get back to speed was to reboot the server when it got stuck. When the server got stuck, it was at 100% CPU for several minutes. That sounds like an obvious CPU problem. Identify the problematic process and uninstall the app which is causing problems.
To identify the issue ETW Traces were taken. Everything was working fine until the ETW profiling events stopped in the recording. Several minutes later the CPU problem did start. Was someone tinkering with ETW Traces? Shortly before the trace did stop I did see an installer for the Palo Alto Cortex AV solution executing. That was interesting. Normally I record data with xperf which uses the NT Kernel Logger ETW session. Xperf supports another kernel session named Circular Kernel Context Logger session which is always there since at least Windows 7. When running ETW tracing for a longer time I record with xperf into a file which is stopped every few minutes. Profiling is then restarted (no merging of the files is done). That way I can later choose from the many files the interesting ones when the issue is reported again. Although I default to 10ms sample rate one needs still up to 40 GB of disk space to record a full day.
The Issue occurred again
Below is a picture of total CPU consumption over a period of two minutes. Green is Idle (means CPU is free) and red is the CPU consumed by the Windows Sleep API call which needs ca. 60% for all cores for more than one minute. Why is sleeping consuming so much CPU?
There is a specific workload which occurs only on many core servers: Spinlocks are locks which try to burn some CPU on all cores before all spinning (= trading CPU for decreased latency) threads enter the expensive OS lock. The usual implementation of a Spinlock uses several stages
Check if lock can be acquired
If not
For(i..n)
Burn some cpu cycles by calling pause instruction in a second loop
Check if lock can be acquired and exit loop
Call Sleep(0) to give another thread to do some useful work
Check if lock can be acquired and exit loop
Enter OS lock
Lock is acquired
Some high performance threading libraries like the Intel OpenMP library use excessive amounts of CPU spinning for up to 200 ms to almost never take a lock to get, at insane power consumption costs, a few microseconds of faster algorithm execution. This means if you let one empty task run every 200 ms you will get this CPU graph when using the Intel OpenMP library:
Sure, your CPU looks super busy, but did you perform much useful work? Not at all. The Intel OpenMP library makes it look like you are doing heavy number crunching and you need a bigger processor to take that load.
If you set the environment variable KMP_BLOCKTIME=0 then you get a flat CPU profile as it should have been always the case. The decision of Intel to waste 200 ms of CPU cycles by default is highly questionable, and no one, after profiling the CPU consumption, would leave it at the default value.
What has this to do with Virus scanners? They have the habit to intercept system calls in creative ways, by using OS callbacks, application detouring and ETW. E.g Windows 11 23H2 has instrumented many rarely used system calls with ETW trace points to make it easier to detect unusual application activity.
Interlude – SysCall Tracing
What can you do if you want to see every operating system call performed by an application. On Linux you can run the application with the strace tool to get a nice console output of all called operation system calls. On Windows no comparable tool exists. But there is a kernel ETW provider to enable SYSCALL tracing which not many people are aware of.
In a wpr profile you enable a SYSCALL Tracing by adding the SystemCall keyword and if you want you can even get the stacktrace for each method call (warning high data volume!). The logged events alone are not that useful, because they are missing the process and thread which did issues the call. To make these events useful you need to record also Context Switch events which record how long a thread was running on a given CPU.
Below is a sample snippet of a typical .wprp recording profile for wpr to enable SysCall tracing:
A full SysCall recording profile is located in the FileWriter project. To use it, download the MultiProfile.wprp with curl which is part of Windows. Then you can use it like this:
The merge step is needed to get method names, otherwise you just get addresses in the resulting trace. You can now open the Trace with WPA (Windows Performance analyzer). Under System Activity – Syscalls a new graph pops up when you load such a trace which allows you to analyze the call duration, frequency, return value and other things:
Unlike strace of linux you cannot get the method input argument but you can get at least the return code which is better than nothing.
We can dump the data also with the help of the TraceProcessing Library with a few lines
usingMicrosoft.Windows.EventTracing;
usingMicrosoft.Windows.EventTracing.Symbols;
usingMicrosoft.Windows.EventTracing.Syscalls;
usingSystem.Diagnostics;
namespacestrace
{
internalclassProgram
{
staticvoidMain(string[] args)
{
var processor = TraceProcessor.Create(args[0], new TraceProcessorSettings
{
AllowLostEvents = true,
});
string symServer = "*http://msdl.microsoft.com/download/symbols";
if (args.Length > 1 && args[1] == "-nosymserver")
{
symServer = ""; // Loading MS symbols makes loading the trace much slower. Once loaded with symbols you can omit them
}
var symbols = processor.UseSymbols();
var syscalls = processor.UseSyscalls();
var sw = Stopwatch.StartNew();
Console.WriteLine("Loading trace");
processor.Process();
Console.WriteLine($"Trace loaded in {sw.Elapsed.TotalSeconds:F0} s");
sw = Stopwatch.StartNew();
Console.WriteLine("Loading symbols");
SymbolPath symPath = new($@"SRV*C:\Symbols{symServer}");
SymCachePath symCache = new(@"C:\SymCache");
symbols.Result.LoadSymbolsAsync(symCache, symPath).Wait();
Console.WriteLine($"Symbols loaded in {sw.Elapsed.TotalSeconds:F0} s");
foreach(ISyscall call in syscalls.Result.Syscalls)
{
if( call.ReturnValue == null) // this was a syscall enter call
{
continue;
}
string? funcName = call?.StackFrame.Symbol?.FunctionName;
string? image = call?.Process?.ImageName;
if (funcName != null && image != null)
{
Console.WriteLine($"{image}({call?.Process.Id}): {funcName} <- 0x{call?.ReturnValue:X}");
}
}
}
}
}
Our much less inferior Windows strace tool can generate this output for e.g. all non succeeded (return code != 0x0) calls when our own strace process did crash because a not existing file was tried to access:
The output is interesting, but without the input arguments we cannot make much progress if the input was e.g. the wrong file or if the failed call was even trying to access our file or something entirely different.
Antivirus Problem
Lets get back to our problem of high CPU in a Sleep call. When we drill deeper into the Sleep call CPU consumption we find in the stacktrace (the numbers are in ms)
a call to PerfInfoLogSysCallEntry and PerfInfoLogSysCallExit. The CPU consumption of SleepEx is not our problem if we run the numbers. Due to ETW Tracing the costs of SleepEx are 33 times higher than originally!
But that is still not the full story. Inside the ETW logger we find a call to tedrdrv.sys for which we do not have symbols. The vast majority of costs are coming from this driver which calls into ExfAcquire/ReleaseRundownProtection which is part of an API which was added to support Antivirus drivers to be able to safely update itself at runtime.
With ETWAnalyzer we can dump the module information for all dlls which have no symbols (= most likely non MS dlls/drivers) which are showing up in expensive stack traces.
EtwAnalyzer %f% -Dump CPU -TopN 5 -ShowTotal File -ProcessName !*idle* -SortBy CPU -TopNMethods 2 -MinMaxCPUms 100000 -Methods *.dll;*.sys -ShowModuleInfo
ETWAnalyzer has an internal list of nearly all Antivirus Drivers. If a driver in the well known list is present in the output it is showing up as yellow text. We find with -ShowModuleInfo (-smi) that tedrdrv.sys is indeed the Palo Alto driver which has intercepted ETW traces, which causes high CPU consumption. The plot is thickening but we still do not have a smoking gun. When profiling is stopped my script always captures the output of xperf -Loggers to see which ETW loggers were running to check if profiling was running properly shortly before we stop our session. The output for NT Kernel Logger is the smoking gun:
Logger Name : NT Kernel Logger
Logger Id : ffff
Logger Thread Id : 0000000000003DC4
Buffer Size : 8192
Maximum Buffers : 118
Minimum Buffers : 96
Number of Buffers : 118
Free Buffers : 70
Buffers Written : 224156
Events Lost : 0
Log Buffers Lost : 0
Real Time Buffers Lost: 0
Flush Timer : 0
Age Limit : 0
Log File Mode : Secure PersistOnHybridShutdown SystemLogger
Maximum File Size : 0
Log Filename : PaloNull
Trace Flags : SYSCALL
PoolTagFilter : *
Remember that the initial ETW Traces were fine until we did not get profiling events? Now we have found the reason. Palo Alto is enabling SysCall tracing by using the NT Kernel Logger ETW session which was already in use by us. Not nice.
Now we have all data to understand the high CPU issue
When multithreaded code is running which uses Spinlocks Sleep(0) is called on all cores in a tight loop. This is the way how Spinlocks work as explained earlier. Since Palo Alto wants to know which kernel methods all applications are calling they enable ETW tracing and hook inside the kernel the ETW trace call to get easy access to that information. Unfortunately the AV driver needs to access some protected kernel object via a “lock” with ExfAcquire/ReleaseRundownProtection. This “lock” is implemented as interlocked memory accesses which are ok if not many threads are accessing that memory lane. This assumption is no longer true in a heavily multithreaded application. The CPU costs of Sleep(0) are now >33 time higher which causes the server to go to 100% CPU consumption for unusual long times (minutes) which makes users wanting to reboot that machine. The server is not responding, but it miraculously recovers again if you patiently wait just long enough.
Conclusions
Antivirus solutions, multithreaded software and many core machines are a deadly mix for performance. The idea to intercept SysCalls might work decently on small machines, but if AV vendors try the same trick on server machines with multithreaded software things fall apart. I hope some AV vendor reads this post and change their mind about the SysCall tracing approach. Since consumer machines get more and more cores this problem will become bigger over the next years. Please stop intercepting SysCalls and let the machine do its work.
In the past AV did get only in the way when files were accessed or processes were started. That can be dealt with configuring the AV solutions to exclude certain directories/processes to limit its impact. But if compute bound workloads are affected in general, what should one do, except to uninstall that AV software? I have bad experiences with configuring AV solutions, because very often the overhead is still there, although the configuration UI says that all is turned off.
You can figure out the impact of AV with ETW but it takes quite some time and a lot of data to get to that point. For the average user it is impossible to find such issues, except if he can do some A/B testing with/without the AV solution. That works, but if the next AV update brings in new tracing capabilities you will have a hard time to figure that one out without ETW. It should not be so hard to quantify the impact of AV solutions given the number of times they are causing issues.
The loss of productivity due the high performance overhead of AV solutions is hard to quantify, but it is definitely not cheap. In my opinion all AV scanning solutions are trying hard to identify every threat, but fail to properly address the adverse performance effects caused by these additional scanning capabilities.
On the other hand all virus “vendors” will test with all currently used AV scanning solutions to get in undetected. In effect AV scanning solutions will be mostly able to detect past threats, but not the latest release of e.g. malware X. I wonder if it is cost effective to detect past threats with such a high overhead while new malware by definition is still able to get in.
If the scanning settings of AV solutions are tuned towards performance and not only for detection rate our fleet of > 1 billion of Windows computers would consume Gigawatts less of electrical energy (conservative estimate: >1 W of additional energy * 10^9 ). This is a global issue which is not on anyone’s agenda because it is so well hidden deep inside your computer which works but consumes more CPU than necessary. Many people talk about the negative environmental effects of Bitcoin mining, but no one talks about the environmental effects of AV scanning overhead.
I hope I have answered the mystery of a stuck server and raised new questions if it is really ok to turn on every AV scanning feature to stay safe. The tradeoff between computer safety and power consumption should be discussed more by security experts to arrive at decisions which are more balanced than it is currently the case. A wild idea would be to have a virtual Watt Meter for every wasted Watt consumed by AV scanners to make this cost explicit. You can only act on things you can measure. This is currently not the case.
The last test is from 2019 which is still accurate but the world is changing and we have arrived at .NET 7.0 which is reason enough to spin up my test suite again and measure from .NET 4.8, 3.1, 5.0, 6.0 up to 7.0. The “old” articles are still relevant despite their age.
I did write the SerializerTests test suite which makes it easy to test all these serializers. If you want to measure for yourself, clone the repository, compile it, execute RunAll.cmd and you get a directory with CSV files which you can graph as you like.
Tooling is getting better and I have also added profiling support so you can check oddities at your own if you call RunAll.cmd -profile you get ETW traces besides the CSV files. This assumes you have installed the Windows Performance Toolkit from the Windows SDK.
Serialize Results
The winner in terms of serialize speed is again FlatBuffer which still has the lowest overhead, but you need to go through a more complex compiler chain to generate the de/serialization code.
*Update 1*
After discussions about the fairness how FlatBuffer should be tested I have changed how FlatBuffer performance is evaluated. Now we use an existing object (BookShelf) which is used as input to create a FlatBuffer object. Additionally it was mentioned that some byte buffer should also be part of the tested object. This moves FlatBuffer from rank 1 to 15, because the new test assumes you have an object you want to convert to the efficient FlatBuffer binary format. This assumption is reasonable, because few people will want to work with the FlatBuffer object directly. If you use the FlatBuffer object directly then the previous test results still apply.
The layout of the serialized Book object include
int Id
string Title
byte[] Payload Before it was null now it contains 10 bytes of data
Our BookShelf object looks as serialized xml like this
Now we serialize nearly 40% of binary data which contains just printable ASCII characters. This “simple” change moves the landscape quite a bit: GroBuf is first and MemoryPack second. This contains also an update from MemoryPack from 1.4.4 -> 1.8.10. If GroBuf has less overhead (but double the serialized size because it stores strings as Utf-16 while most others convert the data to Utf-8), or if MemoryPack has some regression I cannot tell. To reproduce the tests you can run the tests with
RunAll.cmd 10
to use a 10 byte buffer payload for each Book object. If you omit the 10 then you will serialize with 0 bytes of payload. The combined csv file SerializationPerf_Combined_10.csv contains as suffix how much payload per book object was used.
*Update 1 End*
The new entry MemoryPack is second which is the newest serializer by Yoshifumi Kawai who seems to write serializers for a living. He has several entries in this list namely
MessagePackSharp
Utf8JsonSerializer
His latest MemoryPack library shows that you still can get faster once you know what things cost and what you can avoid without becoming too specialized. This one has pretty much all features one could wish for including a Source Generator for Typescript to read the binary data inside the browser or wherever your Typescript code is running.
Additionally he provided a helper to compress the data with Brotli, a very efficient compression scheme which competes with Bois_LZ4 which is currently the only serializer which supports also compression of the data. The Brotli algorithm is nice but there is zstd from Facebook which should be a bit faster which would make a nice addition perhaps in the future?
That is really impressive!
From a higher level we find that the old .NET Framework is getting behind in terms of speed and we see a nice step pattern which proves that every .NET Framework release has become faster than its predecessor.
Binary is good, but Json has also a large fan base. Microsoft did add with .NET 7.0 Source Generators to generate the serialization code at compile time for your objects. See https://github.com/Alois-xx/SerializerTests/blob/master/Serializers/SystemTextJsonSourceGen.cs how you can enable code generation for serialization. This feature generates code only for serialize but uses pregenerated metadata for deserialize. The feature to add pregenerated deserialization code is tagged as future (post .NET 8?). If you think this is important you can upvote the issue. The main reason to add this was not only performance but also AOT scenarios where you have no JIT compiler.
With pregenerated serialization code in SystemTextJsonSourceGen our BookShelf object is 20+% faster serialized compared to SystemTextJson which uses the “normal” runtime generated code. A significant part of the performance improvement seems to be related to directly write UTF8 without the need to convert an UTF-16 char memory stream to an UTF-8 byte stream.
Deserialize Results
*Update 1*
We access after deserialization of the FlatBuffer object all properties which cause the FlatBuffer struct to create a new string or byte[] array on each access again. AllocPerf does not use a binary blob yet. The numbers are therefore identical.
Even with these changes FlatBuffer wins the deserialize race with some margin. The second place goes to GroBuf instead of MemoryPack. But these are really close and had before been fighting for the second place. As already mentioned before GroBuf doubles the serialized data size which allows it to omit the Utf-16 to Utf-8 conversion which saves some speed at the expense of data size.
*Update 1 End*
Normally data is more often read than written. Deserialization needs to allocate all the new objects which makes it largely GC bound. This is the prime reason why deserialize can never be faster than serialize because it needs to allocate new objects while in the serialize case these are already existing.
The fastest serializer is AllocPerf which is not a serializer, but an allocation test which walks over an array of binary data without parsing and constructs strings out of the data. This is a test how fast one could get without any parsing overhead.
MemoryPack sets the mark again which is nearly overhead free. With so little overhead there must be a downside? Yes you loose versioning support to some extent.
The follower after MemoryPack is BinaryPack which is not even trying to be version tolerant and can be used for simple caches but other than that I would stay away from it. But beware it is in maintenance mode!
Another notable change in Protobuf_net was the deprecation since 3.0 of the AsReference attribute which many people are missing. This made it easy to deserialize object graphs with cyclic dependencies, which is with Protobuf_net 3.0 no longer the case.
The now in maintenance mode Utf8JsonSerializer is still 35% faster than SystemTextJson. There is still room for improvement for .NET.
First Startup Results
Besides the pure runtime performance startup costs are also important. That is more tricky to measure since you not only need to pay the costs of reflection, code gen, JIT time for the precompiled (Cross/Ngen) and not precompiled scenario inside a fresh executable. If you do not use a new executable each time you would attribute the costs for first time init effects to the first called serializer in a process which would introduce systematic bias. SerializerTests starts for each serializer a new executable which is started three times (last time is added to CSV) to warmup the loaded dlls to not induce expensive hard page faults which are served from the disk, but from the file system cache. Then it serializes 4 different object types with one object and takes the maximum value which should be the first call. These details allow one to check if just the first code gen is so expensive and subsequent code gen is much cheaper because many one time init effects already have happened.
The old .NET Framework 4.8 looses by a large margin here, but only if Ngen is not used. If the images are Ngenned then it is not too bad compared to crossgenned .NET core binaries.
Is this the full story? Of course not. We are only measuring inside a small loop long after many things have happened, but since precompilation will cause all dependent dlls to be eagerly loaded by the OS loader we should better look at the runtime of the process as a whole. This can show interesting … gaps in our testing methodology.
To look deeper you can run the test suite with ETW profiling enabled with
RunAll.cmd -profile
to generate ETW data while executing the tests. You need to download Windows Performance Toolkit to generate the profiling data and ETWAnalyzer to easily query the generated data. You get from the RunAll.cmd command a bunch of CSV files, a combined file CSV file and some ETW files.
When you put ETWAnalyzer into your path you can extract the ETW file data into Json files which are put into the Extract folder. To convert the ETL files into Json files you add to the ETWAnalyzer command line -Extract All, the folder where the ETL files are located and a symbol server e.g. MS which is predefined. You can also use NtSymbolPath to use your own Symbol search path defined by the _NT_SYMBOL_PATH environment variable.
ETWAnalyzer -Extract All -FileDir D:\Source\git\SerializerTests\SerializerTestResults\_0_14_06 -SymServer MS
The extracted Json files have a total size of 112 MB vs 5 GB of the input ETL files which is a good compression ratio! Now we can generate a nice chart with an Excel Pivot Chart to look how long each of the 3 invocations of the startup tests did run:
And the winner is .NET 4.8! That is counter intuitive after we have measured that .NET 7 is faster everywhere. What is happening here? This chart was generated with input from ETWAnalyzer to dump process start/stop times command line and their duration into a CSV file:
ETWAnalyzer.exe -dump process -cmdline *firstcall#* -timefmt s -csv FirstCall_Process.csv
ETWAnalyzer can also dump the CPU consumption to console so we can drill deeper
Although the command line syntax looks complex it is conceptually easy. You first select what to dump e.g. CPU, Process, Disk, … and then add filters and sorting. Because ETW records a lot of data you get a lot of sorting and filter options for each command of ETWAnalyzer which may look overwhelming but it is just filtering, sorting and formatting after you have selected what to dump.
The following call dumps process CPU consumption for all processes with a command line that contains
firstcall and cross and datacontract to select the .NET Core crossgenned processes OR
firstcall and ngen and datacontract to select the .NET 4.8 Ngened SerializerTests calls
-ProcessFmt s prints process start stop times in ETW recording time (time since session start) and the process duration. You can format also your local time, or utc by using local or utc as time formats.
EtwAnalyzer -dump cpu -cmdline "*firstcall* cross*#datacontract*";*firstcall*ngen*#datacontract* -ProcessFmt s
After drilling into the data it became apparent that since .NET Core 3.1 we are now having two additional serializers (MemoryPack and BinaryPack) which both have significant startup costs. These costs are always payed even if you never use them! I have created a branch of SerializerTests (noBinaryMemoryPack) to remove them so we can compare .NET 4.8 and .NET Core+ with the same number of used dependencies.
Each .NET version has two bars, the lower one is crossgen and the higher one involves JIT compilation of the binaries. Since each process is started 3 times we see in the y-axis the sum of 3 process starts. NET 7.0 seems to have become a bit slower compared to the previous versions. The winner is still .NET 4.8 (455 ms) but at least the difference is not as big as before (558ms vs 972ms).
It is very easy to fool yourself with isolated tests which do not measure the complete (system) picture. Measuring something and drawing a picture is easy. Measuring and understanding what the data means in full details is a much bigger endeavor.
Cool Console Tools
I work a lot with command line tools which produce large width output. The cmd shell has the nice feature to configure a horizontal scroll bar by setting the console width to e.g. 500. This works but newer shells like Windows Terminal are becoming the norm. It works like Linux shells which never had support for horizontal scroll bars. The issue to add a scroll bar is denied https://github.com/microsoft/terminal/issues/1860 which is a pity. But there is an open source solution for it. A little known terminal extension named vtm is offering all features I need. When configuring my command prompt with vtm -r cmd
I get a Windows Terminal window which supports horizontal scrolling, PlainText, RTF, HTML text copy, Ctrl+V (Ctrl+C not yet) and Unix style copying with right mouse click. This makes working with ETWAnalyzer much easier and you can also properly read the help which is also wide output …
EtwAnalyzer -dump CPU -ProcessName SerializerTests.exe -ShowModuleInfo -TopN 1 -NoCmdLine
The command dumps of all extracted files in the current directory the SerializerTests executable with the highest CPU consumption and the module information which includes file version, product name, and directory from where it was started. That can be useful to check if e.g. your local IT has installed the updated software to your machine, or you need to call again to get the update rolled out. By making such previously hard to get information easily queryable I have found that for some reason SerializerTests was running under false flag as protobuf product and since a long time without any file version update. Both issues are fixed, because ETWAnalyzer made it very obvious that something was wrong.
Conclusions
This is the end of this post, but hopefully your start to make a better informed decision which serializer is best suited for your needs. You should always measure in your environment with your specific data types to get numbers which apply to your and not my still synthetic scenario. When you execute RunAll.cmd dd you add to each Book object a byte array with dd bytes which may change the performance landscape dramatically. This can simulate a test where you use a serializer as wrapper over another already serialized payload which did already generate a byte array. That should be cheap, in theory. But when larger byte arrays were used with the fastest Json serializer (UTF8Json) it became one of the slower ones. I tried hard to get the numbers right, but you still need to measure for yourself to get your own right numbers for your use case and scenario.
Performance numbers are not everything. If essential features are there, and you trust the library author that he will provide support over the foreseeable future, only then you should consider performance as next item on your decision matrix to choose a library for mission critical long lived projects. If you are running a pet project, use whatever you seem fit.
Networking is hard and many things can go wrong. I have bought an Oculus Quest VR headset for my daughter. Soon I have found it frustrating to connect it via an USB cable to my PC. To get rid of the cable I have bought a cheap TP-Link Archer C5 AC1200 router which I connect via my LAN cable to the PC. After becoming wireless I can send realtime VR images over my dedicated WIFI to the VR headset with stable 150 MBit/s.
The “real” WIFI (yellow) which is connected to the internet is located in another room which is not capable to transfer 150 MBit/s without any interrupts which would kill the VR experience. That is the reason why nearly everyone recommends to use either an USB cable (inconvenient) or a LAN cable (blue) connected to a dedicated WIFI router for the VR headset.
That setup works great, but I have noticed that whenever I have the second router running that browsing the internet would feel very slow. Every page refresh was taking a lot longer compared to when I did turn the second WIFI router off.
Since all browsers have great development tools I just needed to press F12 to see what was blocking my browser requests. In the example below I have tried to access a far far away web site which shows in the network tab for the first request a delay 10,55s for host name resolution (Dns).
On my box these timings were always around 10s so I knew that I did suffer from slow Dns queries to resolve the host name e.g. http://www.dilbert.com to the actual IP address. Knowing that page load times are blocked by 10+ seconds by Dns is great, but now what? I was suspecting that, since I have two network tabs in Task Manager, that the DNS query was first sent via the LAN cable to a dead end because my Oculus is not connected to the internet and then via WIFI to the actual internet connected router.
Initially I tried the route print command to see where why packets were routed, but that looked too complicated to configure the order for every local route. After a bit more searching I have found the powershell command Get-NetIPInterface which shows you over which network interface Windows sends network packets to the outside world. The relevant column is InterfaceMetric. If multiple network cards are are considered the network interface with the lowest InterfaceMetric number is tried first.
As expected I have found my local Ethernet connection with InterfaceMetric 5 with the preferred connection to send network packets to the outside world. But what should be the other network? The easy way out is to cause some traffic in the browser and check in Task Manager in the Performance Tab which network gets all the traffic:
I need therefore to change the order vEthernet (New Virtual Switch) with metric 15 and interface id 49 and vEthernet (New Virtual Switch) with metric 25 and interface id 10.
These two commands did switch the order of the two relevant networks (ifIndex 10 and 49). How fast are things now? Internet surfing feels normal and the change of switched network ordering did save the day. There are still Dns delays but nowhwere in the 10s region. Case solved.
The average reader can stop reading now.
What follows is a deep dive into how Dns works on Windows and how you can diagnose pretty much any Dns issue with the builtin Dns tracing of Windows with the help of ETWAnalyzer.
So far we have just looked at Dns as a black box, but you can trace all Dns requests with the help of ETW (Event Tracing for Windows) and the Microsoft-Windows-DNS-Client ETW provider. The easiest way to record the data is to download a recording profile like Multiprofile.wprp, download the file e.g. via curl from the command line
Good naming is hard, but if you later want to revisit your recorded files you should try to come up with a descriptive name.
After recording the data you can extract a Json file with ETWAnalyzer by extracting everything (-extract all) or just Dns data with -extract DNS
C:>EtwAnalyzer -extract all -fd c:\temp\SlowBrowsing_Skyrmion.etl -symserver ms
1 - files found to extract.
Success Extraction of c:\temp\Extract\SlowBrowsing_Skyrmion.json
Extracted 1/1 - Failed 0 files.
Extracted: 1 files in 00 00:02:40, Failed Files 0
ETWAnalyzer supports many different queries for e.g. Disk, File, CPU and other things. Since 2.4.0.0 support for dumping Dns Queries was added. To show the aggregated top 5 slowest Dns queries you just need -fd (File or Directory) and the file name of the extracted data and -dump Dns to dump all Dns queries. To select the top 5 add -topn 5 and you get this output per process printed:
C:\temp\Extract>EtwAnalyzer -dump DNS -topn 5 -fd SlowBrowsing_Skyrmion.json
02.10.2022 17:47:32 SlowBrowsing_Skyrmion
NonOverlapping Min Max Count TimeOut DNS Query
Total s s s #
OVRServer_x64.exe(6624) 124 128
12,038 s 0,000 s 12,038 s 4 1graph.oculus.com
backgroundTaskHost.exe(6692) -ServerName:BackgroundTaskHost.WebAccountProvider
12,043 s 0,000 s 12,043 s 2 1login.microsoftonline.com
firefox.exe(13612) 12,047 s 0,021 s 12,047 s 2 1d21lj84g4rjzla.cloudfront.net
12,052 s 0,000 s 12,052 s 2 1loadus.exelator.com
12,052 s 0,000 s 12,052 s 2 1cdn.adrtx.net
Totals: 60,232 s Dns query time for 12 Dns queries
We find that we have indeed some slow Dns queries which all did time out after ca. 12s. The magenta named processes are issuing all of the following DNS queries. To see more we can add -details to see the timing of every slow DNS Query
C:\temp\Extract>EtwAnalyzer -dump DNS -topn 5 -details -fd SlowBrowsing_Skyrmion.json
02.10.2022 17:47:32 SlowBrowsing_Skyrmion
NonOverlapping Min Max Count TimeOut DNS Query
Total s s s #
12,038 s 0,000 s 12,038 s 4 1graph.oculus.com
2022-10-02 17:47:38.416 Duration: 0,000 sOVRServer_x64.exe(6624) DnsAnswer:
2022-10-02 17:47:38.417 Duration: 0,000 sOVRServer_x64.exe(6624) DnsAnswer:
2022-10-02 17:47:38.417 Duration: 12,038 sOVRServer_x64.exe(6624) TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: 2a03:2880:f21c:80c1:face:b00c:0:32c2;::ffff:31.13.92.49
2022-10-02 17:47:38.417 Duration: 12,038 sOVRServer_x64.exe(6624) TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: 2a03:2880:f21c:80c1:face:b00c:0:32c2;::ffff:31.13.92.49
12,043 s 0,000 s 12,043 s 2 1login.microsoftonline.com
2022-10-02 17:47:42.020 Duration: 0,000 sbackgroundTaskHost.exe(6692) DnsAnswer:
2022-10-02 17:47:42.020 Duration: 12,043 sbackgroundTaskHost.exe(6692) TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: ::ffff:40.126.32.74;::ffff:20.190.160.14;::ffff:40.126.32.134;::ffff:40.126.32.72;::ffff:20.190.160.22;::ffff:40.126.32.136;::ffff:40.126.32.138;::ffff:40.126.32.68
12,047 s 0,021 s 12,047 s 2 1d21lj84g4rjzla.cloudfront.net
2022-10-02 17:47:54.203 Duration: 0,021 sfirefox.exe(13612) DnsAnswer: 13.227.153.112;13.227.153.94;13.227.153.81;13.227.153.66
2022-10-02 17:47:54.224 Duration: 12,047 sfirefox.exe(13612) TimedOut: 192.169.0.1:53 Servers: 0.0.0.0;192.169.0.1;fe80::1 DnsAnswer: 2600:9000:225b:fc00:1e:b6b1:7b80:93a1;2600:9000:225b:3600:1e:b6b1:7b80:93a1;2600:9000:225b:2e00:1e:b6b1:7b80:93a1;2600:9000:225b:f000:1e:b6b1:7b80:93a1;2600:9000:225b:7200:1e:b6b1:7b80:93a1;2600:9000:225b:a400:1e:b6b1:7b80:93a1;2600:9000:225b:9200:1e:b6b1:7b80:93a1;2600:9000:225b:dc00:1e:b6b1:7b80:93a1
12,052 s 0,000 s 12,052 s 2 1loadus.exelator.com
2022-10-02 17:47:42.091 Duration: 0,000 sfirefox.exe(13612) DnsAnswer:
2022-10-02 17:47:42.091 Duration: 12,052 sfirefox.exe(13612) TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: ::ffff:54.78.254.47
12,052 s 0,000 s 12,052 s 2 1cdn.adrtx.net
2022-10-02 17:47:42.091 Duration: 0,000 sfirefox.exe(13612) DnsAnswer:
2022-10-02 17:47:42.091 Duration: 12,052 sfirefox.exe(13612) TimedOut: 192.169.0.1:53 Servers: 192.169.0.1;fe80::1;0.0.0.0 DnsAnswer: ::ffff:34.248.125.130;::ffff:3.248.26.101
Totals: 60,232 s Dns query time for 12 Dns queries
You might ask yourself. What are NonOverlapping Totals? The ETW provider Microsoft-Windows-DNS-Client instruments dnsapi.dll!DnsQueryEx which supports synchronous and async queries. By default an IPV4 and IPV6 query is started whenever when you try to resolve a host name. If both queries need e.g. 10s, and are started at the same time the observed delay by the user is 10s (Non Overlapping time) and not 20s (that would be the sum of both query times). To estimate the observed user delay for each query ETWAnalyzer calculates the non overlapping query time which should much better reflect the actual user delays. The summary time Totals is also taking overlapping DNS query times into account which should be a better metric than a simple sum.
The details view shows that every slow timed out query has a red TimedOut Dns server name (192.169.0.1:53). This is the Dns server IP for which a Dns query did time out. By looking at the interfaces with ipconfig we can indeed verify that the new TPLink Router was tried first, did time out and after that the next network (WIFI) was tried which did finally succeed.
If you are in an IPV6 network the Dns server IP might be a generic default Link Local IP which can be the same for several network interfaces. In that case you can add -ShowAdapter to show which network interfaces were tried. That should give you all information to diagnose common Dns issues. Of course you can also use WireShark to directly get your hands on the Dns queries, but with WireShark you will not know which process was waiting for how long. The ETW instrumented DnsQueryEx Api will give you a clear process correlation and also which queries (IPV4, IPV6) were tried. To get all details you can export the data to a CSV file (add -csv xxx.csv) and post process the data further if you need to. To select just one process (e.g. firefox) you can filter by process with -pn firefox which is is the shorthand notation for -processname firefox. For more information on the query syntax of ETWAnalyzer see https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/Filters.md.
If you have just Context switch traces but no DNS Client traces not all is lost. If you suspect DNS delays you can still query the thread wait times of all dns related methods with ETWAnalyzer which also calculates for wait and ready the non overlapping times summed for all threads.
EtwAnalyzer -dump cpu -pn firefox -fd WebBrowsingRouterOn_SKYRMION.10-06-2022.19-46-16 -methods *dns* -fld s s
If you have async queries in place, but you know that just one async operation was initiated you can add -fld s s to show the time difference the method was seen first and last in the trace which should give you a good estimate how long that async method was running, if upon completion it was called again (and consumed enough CPU so we could get a sample). For more details please refer to https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/DumpCPUCommand.md.
You can also use one specific method or event as zero point and let ETWAnalyzer calculate first/last timings relative to a specific event which can be useful to calculate e.g. timing of methods relative to a key event like OnLoadClicked or something similar. That advanced feature is explained at https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/ZeroTime.md in much greater detail. It is also possible to correlate traces across machines without synchronized clocks by using different zero times for different files within one query and export the data to a CSV file for further processing.
That was a lot of information, but I hope I have convinced you that you can access previously hard to get information now much easier with ETWAnalyzer. It is not only useful for performance investigations, but it might be also interesting to get an overview which servers were tried to query from your machine for forensic audits.
There are many extreme software development strategies out there. PDD can mean Panic Driven, or Performance Driven Development. The first one should be avoided, the second is a MSR paper dealing with performance modelling techniques. Everyone writes tests for their software because they have proven to be useful, but why are so few people profiling their own code? Is it too difficult?
I have created a C# console project named FileWriter https://github.com/Alois-xx/FileWriter which writes 5000 100KB files (= 500MB) to a folder with different strategies. How do you decide which strategy is the most efficient and/or fastest one? As strategies we have
As expected the serial version is slowest and the parallel versions are faster. The fastest one is the one which uses 3 threads. Based on real world measurements we have found that 3 threads is the fastest solution. OK problem solved. Move on and implement the next feature. That was easy.
But, as you can imagine there is always a but: Are the numbers we have measured reasonable? Lets do some math. We write 500 MB with the parallel version in 11s. That gives a write speed of 45 MB/s. My mental performance model is that we should be able to max out a SATA SSD which can write 430 MB/s.
This is off nearly by a factor 10 what the SSD is able to deliver. Ups. Where did we loose performance? We can try out the Visual Studio CPU Profiler:
That shows that creating and writing to the file are the most expensive parts. But why? Drilling deeper will help to some extent, but we do not see where it did wait for disk writes because the CPU Profiler is only showing CPU consumption. We need another tool. Luckily there is the Concurrency Visualizer extension for free on the Visual Studio Marketplace, which can also show Thread blocking times.
After installation you can start your project from the Analyze menu with your current debugging settings.
We try to stay simple and start FileWriter with the single threaded strategy which is easiest to analyze
FileWriter -generate c:\temp serial
In the Utilization tab we get a nice summary of CPU consumption of the machine. In our case other processes were consuming a lot of CPU while we were running. White is free CPU, dark gray is System and light gray are all other processes. Green is our profiled process FileWriter.exe.
The first observation is that once we start writing files other processes become very busy which looks correlated. We come back to that later.
The Threads view shows where the Main Thread is blocked due to IO/Waiting/…
You can click with the mouse on the timeline of any thread to get in the lower view the corresponding stack trace. In this case we are delayed e.g. by 2,309 s by a call to WriteFile. For some reason the .NET 6.0 symbols could not be resolved but it might also be just a problem with my symbol lookup (there should be a full stack trace visible). Personally I find it difficult to work with this view, when other processes are consuming the majority of CPU. But since Concurrency Visualizer uses ETW there is nothing that can stop us to open the Concurrency Visualizer generated ETW files from the folder C:\Users\\Documents\Visual Studio 2022\ConcurrencyVisualizer\ with WPA.
We see that FileWriter (green) starts off good but then MsMpEng.exe (red = Defender Antivirus) is kicking in and probably delaying our File IO significantly. In the middle we see System (lila = Windows OS) doing strange things (Antivirus again?).
Do you remember that our SSD can write 430 MB/s? The Disk Usage view in WPA shows disk writes. Our process FileWriter writes 500 MB of data, but there are only 4 MB written in 53ms (Disk Service Time). Why are we not writing to the SSD? The answer is: The Operating System is collecting a bunch of writes and later flushes data to disk when there is time or the amount of written data exceeds some internal threshold. Based on actual data we need to revise our mental performance model:
We write to the OS write cache instead of the disk. We should be able to write GB/s! The actual disk writes are performed by the OS asynchronously.
Still we are at 45 MB/s. If the disk is not the problem, then we need to look where CPU is spent. Lets turn over to the gaps where the green line of FileWriter writing on one thread to disk vanishes. There seems to be some roadblock caused by the System process (Lila) which is the only active process during that time of meditation:
When we unfold the Checkpoint Volume stacktag which is part of my Overview Profile of ETWController we find that whenever NTFS is optimizing the Free Bitmap of the NTFS volume it is traversing a huge linked list in memory which blocks all File IO operations for the complete disk. Do you see the bluish parts in the graph which WPA shows where NTFS Checkpoint volume samples are found in the trace? Automatic highlighting of the current table selection in the graph is one of the best features of WPA. The root cause is that we delete the old files of the temp folder in FileWriter from a previous run which produces a lot of stale NTFS entries which, when many new files are created, at some point cause the OS to wreak havoc on file create/delete performance. I have seen dysfunctional machines which did have a high file churn (many files per day were created/deleted) where all file create/delete operations did last seconds instead of sub milliseconds. See a related post about that issue here: https://superuser.com/questions/1630206/fragmented-ntfs-hdd-slow-file-deletion-600ms-per-file. I did hit this issue several times when I cleaned up ETW folders which contain for each ETL file a large list of NGENPDB folders which the .ni.pdb files. At some point Explorer is stuck for many minutes to delete the last files. The next day and a defrag later (the NTFS tables still need defragmentation even for SSDs) things are fast again. With that data we need to revisit our mental performance model again:
Creating/Deleting large amount of files might become very slow depending on internal NTFS states which are in memory structures and manifest as high CPU consumption in the Kernel inside the NtfsCheckpointVolume method.
That view is already good, but we can do better if we record ETW data on our own and analyze it at scale.
How about a simple batch script that
Starts ETW Profiling
Executes your application
Retrieves the test execution time as return value from your application
Stops ETW Profiling
Write an ETW file name with the measured test duration and test case name
Extract ETW Data for further analysis
That is what the RunPerformanceTests.cmd script of the FileWriter project does. The tested application is FileWriter.exe which writes files with different strategies to a folder. When it exits it uses as exit code of the process the duration of the test case in ms. For more information refer to ProfileTest.cmd command line help.
FileWriter and Measurement Details
Read on if you are eager for details how the scripts generate ETW measurement data or go to the next headline.
The most “complex” of ProfileTest.cmd is where several ETW profiles of the supplied MultiProfile.wprp are started. The actual start command in the ProfileTest.cmd script is
wpr -start MultiProfile!File -start MultiProfile.wprp!CSWitch -start MultiProfile.wprp!PMCLLC
But with shell scripting with the correct escape characters it becomes this:
"!WPRLocation!" -start "!ScriptLocation!MultiProfile.wprp"^^!File -start "!ScriptLocation!MultiProfile.wprp"^^!CSwitch -start "!ScriptLocation!MultiProfile.wprp"^^!PMCLLC
This line starts WPR with File IO tracing, Context Switch tracing to see thread wait/ready times and some CPU counters for Last Level Cache misses. Disk and CPU sampling is already part of these profiles.
The PMCLLC profile can cause issues if you are using VMs like HyperV. To profile low level CPU features you need to uninstall all Windows OS HyperV Features to get Last Level Cache CPU tracing running.
You can check if you have access to your CPU counters via
The easiest fix is to remove the “!ScriptLocation!MultiProfile.wprp”^^!PMCLLC from the ProfileTest.cmd script, or you modify the MultiProfile.wprp to set in all HardwareCounter nodes
the value Strict=”false” to prevent WPR from complaining when it could not enable hardware counters. The MultiProfile.wprp profile uses nearly all features of WPR which are possible according to the not really documented xsd Schema to streamline ETW recording to your needs. The result of this “reverse” documenting is an annotated recording profile that can serve as base for your own recording profiles. Which profiles are inside it? You can query it from the command line with WPR:
D:\Source\FileWriter\bin\Release\net6.0>wpr -profiles MultiProfile.wprp
Microsoft Windows Performance Recorder Version 10.0.19041 (CoreSystem)
Copyright (c) 2019 Microsoft Corporation. All rights reserved.
Default
(CPU Samples/Disk/.NET Exceptions/Focus)
CSwitch +(CPU Samples/Disk/.NET Exceptions/Focus/Context Switch)
MiniFilter +(CPU Samples/Disk/.NET Exceptions/Focus/MiniFilter)
File +(CPU Samples/Disk/.NET Exceptions/Focus/File IO)
Network +(CPU Samples/Disk/.NET Exceptions/Focus/Network)
Sockets +(CPU Samples/Disk/.NET Exceptions/Focus/Sockets)
VirtualAlloc (Long Term)
UserGDILeaks (Long Term)
PMCSample
PMC Sampling for PMC Rollover + Default
PMCBranch
PMC Cycles per Instruction and Branch data - Counting
PMCLLC
PMC Cycles per Instruction and LLC data - Counting
LBR
LBR - Last Branch Record Sampling
Or you can load the profile into WPRUI which is part of the Windows SDK where they will show up under the Custom Measurements node:
All profiles can be mixed (except Long Term) to record common things for unmanaged and .NET Code with the least amount of data. A good choice is the default profile which enables CPU sampling, some .NET events, Disk and Window Focus events. These settings are based on years of troubleshooting, and should be helpful recording settings also in your case. The supplied MS profiles are recording virtually everything which is often a problem that the trace sizes are huge (many GB) or, if memory recording is used, the history is one minute or even less on a busy system which has problems.
Generated Data
RunPerformanceTests.cmd saves the ETW data to the C:\Temp folder
After data generation it calls ExtractETW.cmd which is one call to ETWAnalyzer which loads several files in parallel and creates Json files from the ETW files
ETWAnalyzer -extract all -fd c:\temp\*Create*.etl -symserver MS -nooverwrite
This will generate JSON files in C:\Temp\Extract which can be queried with ETWAnalyzer without the need to load every ETL file into WPA.
As you can see the 500+ MB ETL files are reduced to ca. 10 MB which is a size reduction of a factor 50 while keeping the most interesting aspects. There are multiple json files per input file to enable fast loading. If you query with ETWAnalyzer e.g. just CPU data the other *Derived* files are not even loaded which keeps your queries fast.
Working With ETWAnalyzer and WPA on Measured Data
To get an overview if all executed tests suffer from a known issue you can query the data from the command line. To get the top CPU consumers in the Kernel not by method name but by stacktag you can use this query
EtwAnalyzer -dump CPU -ProcessName System -StackTags * -TopNMethods 2
I am sure you ask yourself: What is a stacktag? A stacktag is a descriptive name for key methods in a stacktrace of an ETW event. WPA and TraceProcessing can load a user configured xml file which assigns to a stacktrace a descriptive (stacktag) name, or Other if no matching stacktag rule could be found. The following XML fragment defines e.g. the above NTFS Checkpoint Volume stacktag:
See default.stacktags of ETWAnalyzer how it can be used to flag common issues like OS problems, AV Scanner Activity, Encryption Overhead, … with descriptive stacktags. If more than one stacktag matches the first matching stacktag, deepest in the stacktrace, wins. You can override this behavior if you set a priority value for a given tag.
Based on the 7 runs we find that the CheckpointVolume issue is part of every test which blocks File writes up to 6+ seconds. Other test runs do no suffer from this which makes comparison of your measured numbers questionable. Again we need to update our mental performance model:
Tests which are slower due to CheckpointVolume overhead should be excluded to make measurements comparable where this internal NTFS state delay is not occurring.
The exact number would be the sum of CPU+Wait, because this operation is single threaded we get exact numbers here. ETWAnalyzer can sum method/Stacktags CPU/Wait times with the -ShowTotal flag. If we add Method, the input lines are kept, if we use Process only process and file totals are shown.
EtwAnalyzer -dump CPU -ProcessName System -StackTags *checkpoint* -ShowTotal method
Now we know that file IO is delayed up to 7461 ms without the need to export the data to Excel. ETWAnalyzer allows you to export all data you see at the console to a CSV file. When you add -csv xxx.csv and you can aggregate/filter the data further as you need it. This is supported by all -Dump commands!
We can also add methods to the stacktags by adding -methods *checkpoint* to see where the stacktag has got its data from
EtwAnalyzer -dump CPU -ProcessName System -StackTags *checkpoint* -methods *checkpoint*
The method NtfsCheckpointVolume has 873 ms, but the stacktag has only 857 ms. This is because other stacktags can “steal” CPU sampling/wait data which is attributed to it. This is curse and feature rolled into one. It is great because if you add up all stacktags then you get the total CPU/Wait time for a given process across all threads. It is bad, because expensive things might be looking less expensive based on one stacktag because other stacktags have “stolen” some CPU/Wait time.
That is all nice, but what is our FileWriter process doing? Lets concentrate on the single threaded use case by filtering the files with -FileDir or -fd which is the shorthand notation for it to *serial*. You need to use * in the beginning, because the filter clause uses the full path name. This allows you to query with -recursive a complete directory tree where -fd *Serial*;!*WithAV* -recursive defines a file query starting from the current directory for all Serial files but it excludes folders or files which contain the name WithAV. To test which files are matching you can use -dump Testrun -fd xxxx -printfiles.
EtwAnalyzer -dump CPU -ProcessName FileWriter -stacktags * -Filedir *serial*
The test takes 12s but it spends 2,2s yet alone with Defender Activity. Other stacktags are not giving away a clear problem we could drill deeper. One interesting thing is that we spend 750ms in thread sleeps. A quick check in WPA shows
that these are coming from the .NET Runtime to recompile busy methods later on the fly. Stacktags are great to flag past issues and categorize activity, but they also tend to decompose expensive operations into several smaller ones. E.g. if you write to a file you might see WriteFile and Defender tags splitting the overall costs of a call to WriteFile. But what stands out that on my 8 Core machine a single threaded FileWriter fully utilizes my machine with Defender activity. We need to update our mental performance model again:
Writing to many small files on a single thread will be slowed down by the Virus Scanner a lot. It also generates a high load on the system which slows down everything else.
Since we did write the application we can go down to specific methods to see where the time is spent. To get an overview we use the top level view of WPA of our CreateFile method
In WPA we find that
CreateFiles
CPU (CPU Usage) 10653 ms
Wait Sum 672 ms
Ready Time Sum 1354 ms
This is the time a thread did wait for a CPU to become free when all CPUs were used
which adds up to 12,679 s which pretty closely (16ms off) matches the measured 12,695 s. It is always good to verify what you did measure. As you have seen before we have already quite often needed to change our mental performance model. The top CPU consumers in the CreateFiles method are
What is surprising is that ca. 2s of our 12s are spent in creating interpolated strings. The innocent line
string line = $"echo This is line {lineCount++}";
costs ca. 16% of the overall performance. Not huge, but still significant. Since we are trying to find out which file writing strategy is fastest this is OK because the overhead will be the same for all test cases.
We can get the same view as in WPA for our key methods with the following ETWAnalyzer query
EtwAnalyzer -dump CPU -ProcessName FileWriter -fd *serial* -methods *StreamWriter.WriteLine*;*OSFileStreamStrategy*ctor*;*StreamWriter.WriteLine*;*DefaultInterpolatedStringHandler*AppendFormatted*;*TlsOverPerCoreLockedStacksArrayPool*Rent;*DefaultInterpolatedStringHandler.ToStringAndClear;*StreamWriter.Dispose;*filewriter* -fld
You might have noticed that the method names in the list have :: between the class and method. ETWAnalyzer uses always . which is less typing and is in line with old .NET Framework code which used :: for JITed code and . for precompiled code.
Additionally I have smuggled -fld to the query. It is the shorthand for -FirstLastDuration or -fld which shows the difference between the Last-First time the method did show up in CPU Sampling or Context switch data. Because we know that our test did only measure DataGenerator.CreateFile calls we see in the Last-First column 12,695 s which is down to the ms our measured duration! This option can be of great value if you want to measure parallel asynchronous activity wall clock time. At least for .NET code the generated state machine classes of the C# compiler contain central methods which are invoked during init and finalization of asynchronous activities which makes it easy to “measure” the total duration even if you have no additional trace points at hand. You can add to -fld s s or -fld local local or -fld utc utc to also show the first and last time in
Trace Time (seconds)
Local Time (customer time in the time zone the machine was running)
UTC Time
See command line help for further options. ETWAnalyzer has an advanced notation of time which can be formatted in the way you need it.
The WPA times for Wait/Ready are the same in ETWAnalyzer, but not CPU. The reason is that to exactly judge the method CPU consumption you need to look in WPA into CPU Usage (Sampled) which shows per method CPU data, which is more accurate. ETWAnalyzer merges this automatically for you.
CPU usage (Precise) Trap
The view CPU usage (Precise) in WPA shows Context switch data which is whenever your application called a blocking OS call like Read/WriteFile/WaitForSingleObject, … your thread is switched off the CPU which is called context switch. You will see in this graph therefore only (simplification!) methods which were calling a blocking OS method. If you have e.g. a busy for loop in method int Calculate() and the next called method is Console.WriteLine, like this,
int result = Calculate();
Console.WriteLine(result)
to print the result then you will see in CPU usage (Precise) all CPU attributed to Console.WriteLine because that method was causing the next blocking OS call. All other non blocking methods called before Console.WriteLine are invisible in this view. To get per method data you need to use the CPU data of CPU sampling which gets data in a different way to much better measure CPU consumption at method level. Is Context Switch data wrong? No. The per thread CPU consumption is exact because this is a trace of the OS Scheduler when he moves threads on/off a CPU. But you need to be careful to not interpret the method level CPU data as true CPU consumption of that method.
CPU Usage (Sampled) Trap
Lets compare both CPU views (Sampled) vs (Precise)
StreamWriter.Dispose method consumes 887 ms in Sampling while in Context Switch view we get 1253 ms. In this case the difference is 41%! With regards to CPU consumption you should use the Sampled view for methods to get the most meaningful value.
Every WPA table can be configured to add a Count column. Many people fell into the trap to interpret the CPU Sampling Count column as number of method calls, while it actually just counts the number of sampling events. CPU sampling works by stopping all running threads 1000 (default) times per second, take a full stacktrace and write that data to ETW buffers. If method A() did show up 1000 times then it gets 1000*1ms CPU attributed. CPU sampling has statistical bias and can ignore e.g. method C() if it is executing just before, or after, the CPU sampling event did take a stack trace.
ETWAnalyzer Wait/Ready Times
The method list shown by ETWAnalyzer -Dump CPU -methods xxx is a summation across all threads in a process. This serves two purposes
Readability
If each thread would be printed the result would become unreadable
File Size
The resulting extracted Json file would become huge
This works well for single threaded workloads, but what about 100 threads waiting for something? WPA sums all threads together if you remove the thread grouping. ETWAnalyzer has chosen a different route. It only sums all non overlapping Wait/Ready times. This means if more than one thread is waiting, the wait is counted only once.
The Wait/Ready times for methods for multi threaded methods are the sum of all non overlapping Wait/Ready times for all threads. If e.g. a method was blocked on many threads you will see as maximum wait time your ETW recording time and not some arbitrary high number which no one can reason about. It is still complicated to judge parallel applications because if you have heavy thread over subscription you will see both, Wait and Ready, times to reach your ETW recording time. This is a clear indication that you have too many threads and you should use less of them.
Make Measurements Better
We have found at least two fatal flaws in our test
Anti Virus Scanner is consuming all CPU even in single threaded test
Add an exclude rule to the folder via PowerShell
Add-MpPreference -ExclusionPath “C:\temp\Test”
To remove use
Remove-MpPreference -ExclusionPath “C:\temp\Test”
Some test runs suffer from large delays due to CheckpointVolume calls which introduces seconds of latency which is not caused by our application
Change Script to delete temp files before test is started to give OS a chance to get into a well defined file system state
When we do this and we repeat the tests a few times we get for the single threaded case:
Nearly a factor 3 faster for changing not a single line of code is pretty good. The very high numbers are coming from overloading the system when something else is running in the background. The 12s numbers are for enabled Defender while < 10s are with a Defender exclusion rule but various NtfsCheckpoint delays. When we align everything correctly then we end up with 7s for the single threaded case which is 71 MB/s of write rate. The best value is now 3,1s for 4 threads with 161 MB/s.
What else can we find? When looking at variations we find also another influencing factor during file creation:
We have a single threaded test run with 9556 ms and a faster one with 7066 ms. Besides the CheckpointVolume issue we loose another 1,3 s due to reading MFT and paged out data. The faster test case (Trace #2) reads just 0,3 ms from the SSD, while we read in the slower Run (Trace #1) 1,3s from disk. The amount of data is small (8 MB), but for some reason the SSD was handing out the data slowly. Yes SSDs are fast, but sometimes, when they are busy with internal reorg things, they can become slow. SSD performance is a topic for another blog post.
We need to update our mental performance model
File System Cache state differences can lead to wait times due to Hard faults. Check method MiIssueHardFault if the tests have comparable numbers.
Etwanalyzer -dump CPU -pn Filewriter -methods MiIssueHardFault
That are another 2 s of difference which we can attribute to File System cache state and paged out data.
Measuring CPU Efficiency
We know that 4 threads are fastest after removing random system noise from our measured data. But is is it also the most efficient one? We can dump the CPUs Performance Monitoring Counters (PMC) to console with
ETWAnalyzer -dump PMC -pn FileWriter
The most fundamental metric is CPI which is Cycles/Instructions. If you execute the the same use case again you should execute the same numbers of instructions. When you go for multithreading you need to take locks and coordinate threads which costs you extra instructions. The most energy efficient solution is therefore the single threaded solution, although it is slower.
To get an overview you can export data into a CSV file from the current directory and the initial measurements we did start with. The following query will dump the Performance Monitoring Counters (PMC) to a CSV file. You can use multiple -fd queries to select the data you need:
The CSV file contains the columns Test Time in ms and CPI to make it easy to correlate the measured numbers with something else e.g. CPI in this case. This pattern is followed for all -Dump commands of ETWAnalyzer. From this data we can plot Test Timing vs CPI this is Cycles per Instruction into a Pivot Table:
The initial test where we had
Defender
CheckpointVolume
issues was not only much slower, but CPU efficiency wise much worse: 1,2 vs 0,69 (smaller is better). This proves that even if Defender would be highly efficient on all other cores and not block our code at all we still would do much worse (73%), because even when we do exactly the same, CPU caches and other internal CPU resources, are slowing us down. With this data you can prove why you are slower even when everything else is the same.
Conclusions
If you have read that far I am impressed. That is a dense article with a lot of information. I hope I have convinced you that profiling your system with ETW and not only your code for strategic test cases is an important asset in automated regression testing.
If you look only at your code you will miss the universe of things that are moving around in the operating system and other processes. You will end up with fluctuating tests which no one can explain. We are doing the same thing 10 times and we still get outliers! The key is not only to understand your own code, but also the system in which it is executing. Both need attention and coordination to achieve the best performance.
To enable that I have written ETWAnalyzer to help you to speed up your own ETW analysis. If you go for ETW you are talking about a lot of data which needs a fair amount of automation to become manageable. Loading hundreds of tests into WPA is not feasible, but with ETWAnalyzer mass data extraction and analysis never was easier. The list of supported queries is growing. If you miss something file an issue so we can take a look what the community is needing.
To come back to the original question. Is profiling too difficult? If you are using only UI based tools then the overhead to do this frequently it is definitely too much. But with the scripting approach shown by FileWriter to record data and extract it with ETWAnalyzer, to make the data queryable, is giving you a lot of opportunities. ETWAnalyzer comes with an object model for the extracted Json files. You can query for known issues directly and flag past issues within seconds without the need to load the profiling data into a graphical viewer to get an overview.