This condition is caused by an overloaded OpsMgr system because the OpsMgr system (for example, the root management server) is too busy or is offline. Event Source: HealthService Please associate an account with the profile. Determine whether the health service is running on the management server or gateway. Event ID: 21016 Operations Manager database or data warehouse performance issues, Management server or gateway server performance issues. This usage can be significant when the gateway is taken temporarily offline and must then handle accumulated agent data that the agents generated and tried to send when the gateway was still offline. Try to determine what kind of workloads the management server or gateway is monitoring. For 32-bit servers that have more than 4 GB of RAM, the /3gb switch in Boot.ini could actually limit the amount of memory that SQL Server can address. Event Source: OpsMgr Connector OpsMgr was unable to set up a communications channel to %1 and there are no failover hosts. If the agents that report to a particular management server or gateway are unavailable, troubleshooting should start at the management server or gateway level. Examine the Operations Manager event log on the agent for any of the events that are listed in Scenario 2. Event Source: OpsMgr Connector Health Service Management Groups(*)\Active File Uploads: The number of file transfers that this gateway is handling. If the health service has stopped responding, generate an ADPlus dump in a service hang mode to help determine the cause of the problem. The health service watcher had received heartbeats previously and the state was reported as healthy. Workflow will not be loaded. Therefore, for servers that have more than eight processors, you generally should set Max Degree of Parallelism to a value of 8. Press question mark to learn the rest of the keyboard shortcuts. Session-context: OpsMgr Connector\Data Bytes Received: The number of data bytes received by the management server - that is, the size of incoming data after decompress), OpsMgr Connector\Data Bytes Transmitted: The number of data bytes sent by the management server - that is, the size of outgoing data before compression). ), Amount of free space on drives that contain data warehouse or Ops and Tempdb files, RAID level (0, 1, 5, 0+1 or 1+0) for drives that are used by SQL Server, If SAN storage is used: number of spindles on each LUN that's used by SQL Server, If the converted Exchange 2007 management pack is being used or has ever been used: number of rows in the LocalizedText table in the Ops database and in the EventPublisher table in the data warehouse database. Green check mark: The agent or management server is running normally. Event Source: HealthService report. There is a problem on the agent or management server. Event ID: 623 For servers that have more than eight processors, the time that it takes SQL Server to coordinate the use of all processors may be counterproductive. Do the agents report to the same management server? Event Description: OpsMgr Connector\Bytes Transmitted: The number network bytes sent by the gateway - that is, the number of outgoing bytes after compression. Check the event log on the server for the presence of 20000 events, indicating that agents which are not approved are attempting to connect. If this value remains at a high level for a long time, and there is not much management pack importing at a given moment, these conditions may generate a problem that affects file transfer. The operation will be retried.%rException '%5': %6 %n%nOne or more workflows were affected by this. %n%n This condition may have occurred because the Account is not configured to be distributed to this computer. Event Description: For agentless systems, for network devices, and for Unix and Linux servers, troubleshooting should start at the agent, management server, or gateway that is monitoring these objects. If this value remains higher than 10 for a long time, and it does not drop, this indicates that the queue is backed up. The default value is 0, which means that all available processors will be used. Cannot access plain text RunAs profile in workflow "%4", running for instance "%3" with id:"%2". Event Description: Event ID: 20051 Workflow will not be loaded. When the gateway is relaying a large amount of data, both the CPU and I/O operations may show high usage. Events 1104, 1105, 1106, 1107, and 1108: These events may cause Events 1102 and 1103 to occur. Session-context ThreadId: . The error code is %3(%4). Event Source: HealthService Failed to store data in the Data Warehouse.%rException '%5': %6 %n%nOne or more workflows were affected by this. Such workloads might include network devices, cross-platform agents, synthetic transactions, Windows agents, and agentless computers. Event Description: OpsMgr DW Writer Module(*)\Batches/sec: The number of batches received by data warehouse write action modules per second. Does this issue occur during a specific time of the day? To troubleshoot management server or gateway performance and SQL Server performance, see the Resolution for scenario 4 section. If the disk array is not at maximum I/O capacity, the next most likely bottleneck is the CPU. %n%nManagement Group: %1 %nRun As Profile: %7 %nSecureReferenceOverride name: %6 %nSecureReferenceOverride ID: %4 %nObject name: %3 %nObject ID: %2 %nAccount SSID: %5. To resolve the issue, first determine the cause of the issue. If there is event ID 21006, follow the same guidelines that are mentioned in Resolution for scenario 2. Gray agent name, gray check mark: The health service watcher on the Root Management Server (RMS) that is watching the health service on the monitored computer is no longer receiving heartbeats from the agent. Between 20 - 50 ms: slow, needs attention, Greater than 50 ms: serious I/O bottleneck, Process(HealthService)\Private Bytes (depending on how many agents this gateway is managing, this number may vary and could be several hundred megabytes), Process(MonitoringHost*)\% Processor Time. The last three counters in this list should consistently have values of approximately .020 (20 ms) or lower and should never exceed .050 (50 ms). This condition is caused by an overloaded OpsMgr system because the management server or database is too busy or is offline. Event Category: Transaction Manager Although you are able to clear the agent cache, this doesn't resolve the issue. Event Description: Management group "%1", Event ID: 1108 Exclude the agent cache from antivirus scanning. This number should be same as the number of agents or management servers that are directly connected to the gateway. All data that is received by the gateway and from the agents is stored in a persistent queue on disk, to be read and forwarded to the management server by the gateway Health service. Event Source: HealthService It is likely that a long-running transaction is preventing cleanup of the version store and causing it to build up in size. Cleanup: . 1. Verify that the system time is correct and re-issue the certificate if necessary%n Certificate Valid Start Time : %1%n Certificate Valid End Time : %2, Event Source: ESE If you locate the following specific events, follow these guidelines: Events 1102 and 1103: These events indicate that some of the workflows failed to load. During a configuration update burst (that is caused by MP import and discovery), the typical bottlenecks are, first, the CPU, and second, the OpsMgr installation disk I/O. One grey check mark means the message has been successfully sent, but it does not necessarily mean the message has been delivered. Grey Ring with Grey Check - Sent/Delivering More accurately Messenger heard "I got your message," from the server and is waiting to hear back from the server that the recipient(s) device … To do this, run the following command in SQL Query Analyzer: Drive letters that contain data warehouse or Ops and Tempdb files, Whether the antivirus software is configured to exclude SQL data and log files (Scanning SQL Server database files with antivirus software can degrade performance. Event ID: 4506 Original product version:   Microsoft System Center 2012 Operations Manager 1. Entity state change flow is stalled with pending acknowledgement. If this number is often greater than 60, a database insertion performance issue is occurring. If the gateways that report to a particular management server are unavailable, troubleshooting should start at the management server level. Typically, this would occur because of misconfigured Run As accounts. If Kerberos is being used, verify that the agent can communicate with Active Directory. An Account specified in the Run As Profile "%7" cannot be resolved. If % Idle Time is low, and the values for these two counters don't meet the expected throughput of the drive, engage the SAN vendor to troubleshoot. Disk sec/Write: The average time, in seconds, to write data to the disk. Event Source: OpsMgr Connector SQL Server performance troubleshooting guide provide deeper insight into troubleshooting SQL Server performance. Additionally, run a simultaneous network trace between the agent and the management server while you reproduce the communication failures. How do you typically recover from this situation (for example, restart the agent health service, clear the cache, rely upon automatic recovery)? Event Description: For more information, see Recommendations for antivirus exclusions that relate to Operations Manager. Consider the following conditions: Only a few agents are affected by the issue. Summary: %2 rule(s)/monitor(s) failed and got unloaded, %3 of them reached the failure limit that prevents automatic reload. OpsMgr database Write Action Modules(*)\Avg. During operational data insertion, the database disks are primarily used for writes. CPU spikes may occur during heavy partitioning activity (when tables become large and then get partitioned), the generation of complex reports, and large amounts of alerts in the database, with which the data warehouse must constantly sync up. However, the grooming of the alert and state change tables can be CPU-intensive for large tables. Amount of memory that is allocated to SQL Server, Whether SQL Server is 32-bit and is AWE enabled. Usually, the disks are performing few reads, except to handle manually generated Reporting views because these run queries on the data warehouse. Health Service Management Groups(*)\Bind Data Source Item Incoming Rate: The number of data items received by the management server for database or data warehouse data collection write actions. Event Description: The OpsMgr Connector connected to %1, but the connection was closed immediately after authentication occurred. Event Source: HealthService Event Description: Does it have a check mark next to the number even though it it it greyed out? In these cases, the disks are mostly busy performing writes. Troubleshooting typically starts at the level immediately above the unavailable component. The General tab includes the SQL Server version, the Windows version, the platform, the amount of RAM, and the number of processors. If OS is 32-bit and RAM is 4 GB or greater, check whether the /pae or /3gb switches exist in the Boot.ini. The parent server of the agents is temporarily offline. A setting of 0 is fine for servers that have eight or fewer processors. Data was dropped due to too much outstanding data in rule "%2" running for instance "%3" with id:"%4" in management group "%1". The OpsMgr Connector could not connect to %1:%2. If this number is 5,000, a data item burst is occurring. Event Description: Although you are able to clear the agent cache to help resolve the issue temporarily, the problem recurs after a few days. For more information, see How to use ADPlus.vbs to troubleshoot "hangs" and "crashes". In the OpsMgr console, these changes require an OpsMgr configuration and an MP redistribution to the agents. When % Idle Time is low (10 percent or less), this means that the disk is fully utilized. Event Source: HealthService Network outages caused a temporary communication failure between the parent server and the agents. A monitoring host is unresponsive or has crashed. Health Service Management Groups(*)\Send Queue % Used: The size of persistent queue. These options could be configured incorrectly if the server was originally installed by having 4 GB or less of RAM, and if the RAM was later upgraded. The Memory tab includes the memory that is allocated to SQL Server. Does this issue persist if you fail over these agents to another management server or gateway? These events typically indicate that performance issues exist on the management server or Microsoft SQL Server that is hosting the OperationsManager or OperationsManagerDW database: Event ID: 2115 RAID level (0, 1, 5, 0+1 or 1+0) for the drive that is used by the Health Service State, Whether battery-backed write cache is enabled on the array controller. Apply the appropriate hotfix to the affected operating systems. Event Source: OpsMgr Connector OpsMgr Connector\Open Connections: The number of connections that are open on gateway. Agents are flooding the management server with operational data, such as alerts, states, discoveries, and so on.
