Troubleshooting Sitecore on Azure PAAS – Part 2

Check out the Part – 1 of Troubleshooting Sitecore on Azure PAAS if you haven’t already.

First request slow? No. Subsequent requests after first request slow.

Generally when we access any website URL, first time it takes time. As it downloads all the assets like CSS, JavaScript, Images etc. But for all subsequent requests it loads page faster as browser has already cached the static assets.

We are having a site which is on Sitecore 9.0.2 on Azure PAAS. Dev/QA are single app while UAT have different CM and CD. All our web app are on Standard tier S3 plan.

Here, i was facing exact opposite scenery. First request was quite fast (Approx 3-4 seconds). But if i refresh the page or access any page of site just after first request, it get’s spinning. And it takes a lot of time (Approx 80-90 seconds). Strange. Here’s what i tried:

Check if custom pipelines or components are culprit

  • I tried first disabling all the custom httprequestbegin and MVC.PageItem pipelines. That was the first thing in the list as we were having quite a few processors in these pipelines for Language resolving, Currency resolving, Page Not Found, Error Page, Wildcard module. But even after disabling all the things, same issue. Not a slight improvement. Note: We left few processor enabled as without that site would not function properly. But we verified there isn’t anything in that processor that can cause this issue.
  • I was sure that we are not that much talented that we write code that behave in this way. Still i wanted to remove this possibility. So, i created an item which was only having a Sample MVC layout and not a single component. When i was accessing this item URL, still the issue was same. So, there isn’t any issue in any of the component.

I diverted the route to

Environment & Application Insights

  • Live Metrics Stream

    • You can navigate to your Application Insight resource for your web app. Click on Live Metrics Stream from left navigation. With Live Metrics Stream, you can:
      • Validate a fix while it is released, by watching performance and failure counts.
      • Watch the effect of test loads, and diagnose issues live.
      • Focus on particular test sessions or filter out known issues, by selecting and filtering the metrics you want to watch.
      • Get exception traces as they happen.
      • Experiment with filters to find the most relevant KPIs.
      • Monitor any Windows performance counter live.
      • Easily identify a server that is having issues, and filter all the KPI/live feed to just that server.

      Live matrics stream

    • So, i opened Live Metrics Stream and requested URL. As soon as i hit the URL i saw that CPU was going too high. And thus all subsequent requests were taking a lot time. If we wait for CPU to come normal and then hit URL then it was working fine.
      2019-04-01 22_59_17-Live Metrics Stream - Microsoft Azure
    • But it was not the actual web app which was going high on CPU. So, we needed to find out which process on this instance is going high on CPU.
  • Process Explorer

    • Go to Azure and open App Service blade and click on Advanced Tools. Advanced Tools provides a collection of developer oriented tools and extensible points for your App Service Apps.
      how-to-identify-the-cause-100-2
    • Click on Process Explorer. Here you will see all the processes running in the site’s context:
      how-to-identify-the-cause-100-3.jpg
    • But in our case, process explorer was showing few more processes which were related to SnapshotUploader. As shown:
      2019-04-27 14_03_32-Process Explorer

      • Main w3wp is an instance of an application.
      • ApplicationInsightProfiler – Profiler to collect profiling traces that help you see where time is spent in code
      • DaasRunner – web job running the Diagnostics as a Service Site Extension
      • SnapshotUploader – Collect call stacks for your application when an exception is thrown
    • And as per stats SnapshotUploader was using the most of the CPU time and private memory. So, now we know what is causing the issue. We will see how we configured it and why it’s causing the issue.
  • Snapshot Debugger

    • I went to app service and selected Application Insights to see all the configurations or features we are using related to Application Insights
      ai1ai2
    • As per the Application insights site extension configuration, Snapshot debugger was enabled.
    • What is Snapshot debugger ?
      • When an exception occurs, you can automatically collect a debug snapshot from your live web application. The snapshot shows the state of source code and variables at the moment the exception was thrown.
      • After an exception has occurred in your application and a snapshot has been created, you should have snapshots to view. It can take 5 to 10 minutes from an exception occurring to a snapshot ready and viewable from the portal. To view snapshots, go to Application Insights resource, in the Failure pane, select the Operations button when viewing the Operations tab, or select the Exceptions button when viewing the Exceptions tab:
        failures-page
      • Select an operation or exception in the right pane to open the End-to-End Transaction Details pane, then select the exception event. If a snapshot is available for the given exception, an Open Debug Snapshot button appears on the right pane with details for the exception.
        e2e-transaction-page
      • In the Debug Snapshot view, you see a call stack and a variables pane. When you select frames of the call stack in the call stack pane, you can view local variables and parameters for that function call in the variables pane.
        open-snapshot-portal
    • What was causing the Snapshot Debugger to generate so many snapshot?
      • As we were facing this issue on Dev/QA environments. Where generally, we won’t have complete data. In couple of cases we were logging such information as Error. And these errors which we were generating on our own were having count of 3-4 per request. And thus Snapshot Debugger be always busy generating snapshots. And thus it was utilizing most of the CPU and Memory resources and not allowing actual web app to perform well.
      • We changed log level to Warn as it wasn’t actual Error but a configuration issue.
      • We also disabled the Snapshot Debugger feature as web app was not having any problem or we weren’t seeing many run-time exceptions in logs. You need to restart the web app after this change to come into effect.

Conclusion

There are lot of things available to trace any issues/exceptions real time in Application Insights. It’s purely on need of a project how much you want to use from all these tools.

Base line should be go slow as all these tools can be heavy on resources as well as cost. Enable only must required functionality from Application Insights. If you face any issue, you can always turn on and validate.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s