VPN tunnels are often set up between on-premises environments and Azure. Sometimes these break off regularly and have to be restarted. However, there is often an incorrect configuration between the Azure VPN gateway and the on-premises gateway. The following are typical solution steps that have usually helped me.

0. Activate Diagnostic Logging

As a preparation, the diagnostic setting of the VPN gateway should be activated. There are different categories here that contain different information. I would definitely recommend the TunnelDiagnosticLog and IKEDiagnosticLog categories. You can find a brief explanation of the content in the Microsoft Docs.

The Azure Monitor is used for activation and the corresponding VPN gateway is selected under the Diagnostic Setting (already done in the screenshot, therefore Diagnostic status is setted as enabled):

Then the required categories and a log analytics workspace are selected:

1. Check SA-Lifetime and SA-Datasize

The Azure gateway’s IPSec configuration consists of 2 phases. In phase 1, the SA-Lifetime is particularly relevant for tunnel disruption and in phase 2, its SA-Lifetime and the SA-Datasize. All this information can be found in the Connect Event of the TunnelDiagnosticLog and should be compared to the configuration of the on-premises gateway:

AzureDiagnostics
| where Category == "TunnelDiagnosticLog"
| where TimeGenerated < ago(10d)
| extend IdleDurationSeconds = extractjson("$[0].IdleDurationSeconds", ikeSAs_Qms_s)
| extend LifetimeKilobytes = extractjson("$[0].LifetimeKilobytes", ikeSAs_Qms_s)
| extend LifetimeSeconds = extractjson("$[0].LifetimeSeconds", ikeSAs_Qms_s)
| project TimeGenerated, ikeSAs_LifeTimeSeconds_d, IdleDurationSeconds, LifetimeKilobytes, LifetimeSeconds
| summarize LastUsedAtConnect = max(TimeGenerated) by ikeSAs_LifeTimeSeconds_d, IdleDurationSeconds, LifetimeKilobytes, LifetimeSeconds

2. Encryption Domains

There are 3 VNets (10.41.3.0/25, 10.41.3.128/29, 10.41.3.144/28) in the Azure environment, which are peered accordingly so that the VMs in these VNets can use the VPN tunnel. By checking IKEDiagnosticLog or TunnelDiagnosticLog, it can be seen that earlier the wrong encryption domains (red boxes) and later the correct VNets were transmitted. The correction was made in the period in between and the results are in the green boxes.

The wrong configuration can be found in the IKEDiagnosticLog:

AzureDiagnostics
| where Category == "IKEDiagnosticLog" 
| where TimeGenerated  > ago(10d)
| extend vnet=extract(" StartAddress 10.41.3.[0-9]{1,3} EndAddress 10.41.3.[0-9]{1,3}", 0, Message) 
| project vnet, TimeGenerated
| where isempty(vnet) == false
| summarize max(TimeGenerated) by vnet

As well in the TunnelDiagnosticLog:

AzureDiagnostics
| where Category == "TunnelDiagnosticLog" 
| where TimeGenerated > ago(10d)
| extend vnet=extractjson("$[0].Tsi", ikeSAs_Qms_s)
| where isempty(vnet) == false
| project vnet, TimeGenerated 
| summarize lastSend = max(TimeGenerated) by vnet
| order by lastSend desc

3. MTU & MSS Clamping

Microsoft also recommends the following sizes for MTU and MSS clamping, because they cannot be configured either:

For Azure, we recommend that you set TCP MSS clamping to 1,350 bytes and tunnel interface MTU to 1,400.

(Source: https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-tcpip-performance-tuning#vpn-and-mtu)