Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for an Optimization Toggle for Real-time Mode to Reduce Memory Overhead Using RingBuffer #1963

Open
floydchenv opened this issue Dec 27, 2023 · 2 comments

Comments

@floydchenv
Copy link

Hello PerfView Team,

I am working on a UE4 game project and currently utilizing the Microsoft.Windows.EventTracing library to parse stack information corresponding to samples and ContextSwitches recorded in ETL for the game process. My approach involves classifying stack information for each frame based on timestamps, ultimately yielding stack data for all threads in each frame. The results are as follows:
image

Currently, we use xperf to record the necessary performance data. Here is a snippet of the parsing code we use:

  IPendingResult<ISymbolDataSource> pendingSymbolData = trace.UseSymbols();
  IPendingResult<ICpuSampleDataSource> pendingCpuSamplingData = trace.UseCpuSamplingData();
  IPendingResult<ICpuSchedulingDataSource> myCpuSchedlingData = trace.UseCpuSchedulingData();
  var pendingProcessorCounters = trace.UseProcessorCounters();
  var pendingProcesses = trace.UseProcesses();
  var pendingMeteData = trace.UseMetadata();
  var pendingSystemInfo = trace.UseSystemMetadata();
  var pendingTraceStatistics = trace.UseTraceStatistics();
  trace.Process();
  
  ISymbolDataSource symbolData = pendingSymbolData.Result;
  ICpuSampleDataSource cpuSamplingData = pendingCpuSamplingData.Result;
  ISystemMetadata systemMetadata = pendingSystemInfo.Result;
  
  foreach (ICpuSample sample in cpuSamplingData.Samples)
  {
    if (sample.Stack != null)
    {
      //...get sample info
      foreach (var frameInfo in sample.Stack.Frames)
      {
          //... get callchin list
      }
    }
  }
  
  if (myCpuSchedlingData.HasResult)
  {
      foreach (ICpuThreadActivity slice in myCpuSchedlingData.Result.ThreadActivity)
      {
          if (slice.SwitchIn.Stack != null && 
              slice.WaitingDuration != null && 
              slice.Thread.Name != null)
          {
              
              //...get slice sample info
              foreach (var frameInfo in slice.SwitchIn.Stack.Frames)
              {
                  //... get callchin list
              }
          }
      }
  }

We are now looking to leverage ETW's Real-time mode for on-the-fly data recording and parsing. However, we've encountered a significant issue: if we enable the collection of ContextSwitch and Dispatcher stack information in a Microsoft.Diagnostics.Tracing.TraceEvent session, we observe a rapid increase in memory usage (more than 1+ MB/s), with no signs of stabilization or decrease.

  session.EnableKernelProvider(
      KernelTraceEventParser.Keywords.Profile
      | KernelTraceEventParser.Keywords.ContextSwitch
      | KernelTraceEventParser.Keywords.Dispatcher
      | KernelTraceEventParser.Keywords.Process
      | KernelTraceEventParser.Keywords.ImageLoad
      | KernelTraceEventParser.Keywords.Thread
                  ,
      KernelTraceEventParser.Keywords.Profile 
      | KernelTraceEventParser.Keywords.ContextSwitch
      | KernelTraceEventParser.Keywords.Dispatcher
      | KernelTraceEventParser.Keywords.Process
      | KernelTraceEventParser.Keywords.ImageLoad
      | KernelTraceEventParser.Keywords.Thread
  );

Upon investigating memory allocations with dotMemory, we noticed that most of the memory usage is concentrated in GrowableArray.

Would it be possible to implement a RingBuffer mechanism to store this data in Real-time Sessions? This feature could greatly optimize memory usage for real-time performance analysis, particularly in complex applications like ours.

@floydchenv
Copy link
Author

Additionally, I have a question about the relationship between Microsoft.Windows.EventTracing and Microsoft.Diagnostics.Tracing.TraceEvent. How are these two related?

Apart from this, I am also attempting to handle the stack data using native C++ APIs. However, I am quite confused about how to reconstruct stack information from the InstructionPointer recorded in SampledProfileEvent, as well as how to interpret ContextSwitchEvent.
image
image
image

Is the reconstruction dependent on StackWalkEvent? I've been struggling to understand the correlation while reviewing the code. Could you please provide a detailed explanation of this relationship? Thank you!

@brianrob
Copy link
Member

brianrob commented Jan 9, 2024

Hi @floydchenv. There are a few questions here, so let me to try to address each one.

First, I am not surprised that you see lots of memory usage in GrowableArray when you enable high verbosity events in a real-time session. This is most often caused by the fact that we must keep a certain amount of data around that is used later to support analysis. For example, we must keep dynamic symbol information around if we're going to be resolving symbols for jitted code in stacks. This is over and above the ring-buffer type approach that is being used to limit how much data is kept on-hand. How many ETW events are kept around during a live session is related to how quickly you are processing the incoming events. If you are processing them slower than the incoming rate, then you'll see committed size grow. As a next step here, I would be interested to understand which GrowableArray(s) are taking up most of the memory. Also, how much as a percentage do these data structures represent of the total process committed size.

On the two different libraries, there isn't really a relationship between them other than that they both can parse ETW events. They grew up separately and are maintaned by two different teams. I'm not super familiar with Microsoft.Windows.EventTracing, so I can't comment on how it compares to TraceEvent.

With regard to stack handling, the Sample event will contain the IP, but the event also has a stack associated with it. The stack was captured during collection by the kernel's stack walker and saved into the trace. You should not need to walk the stack explicitly. Depending on how the trace is collected (if stack compression is enabled), you may need to do the work to capture the stack and then match it to the event. Take a look at

kernelParser.StackWalkStack += delegate (StackWalkStackTraceData data)
. Also worth calling out that stacks come in pieces. There is a kernel and a user fragment and they must be put together.

To parse the ContextSwitch event, here's a pointer to the code that TraceEvent uses to parse the payload:

public sealed class CSwitchTraceData : TraceEvent
.

Hope that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants