Have you ever just sat there watching all of the processes running against the database, just waiting for something to happen? Of course not! We are database administrators, architects, developers and we don’t have time for that! We also might get bored wondering why there seems to be millions of the same query running against the database – a discussion for another time. However, there would be reasons that we are watching processes or statements running in against the database and it is normally because something was alerted or the phone rang asking, “why is the database so slow”.
The big question is what should we be alerted on to try to beat the phone call to a solution, or what are some of the critical issues we still have to deal with in more automated environments? Even with monitoring tools, what are the alerts we need and the thresholds to set them at? There are several things that I can think of to monitor, but there have been a select few that have caused the database to be down or not available no matter what version or option.
The monitoring can be done with tools, such as Oracle Enterprise Manager or other monitoring tools, or scripts, but the list is short and there are specific things to look for to prevent the outage or issue. Here is my short list to have monitored on any database:
The listener being available is obvious for new connections coming into the database. Maybe the listener is now in the grid home or the database home but if it isn’t up and running there is no access to the database. The listener is easy to check if it is available at the OS level with checking for tns process. Also using lsnrctl command line will give status of the listener and the databases that it is listening for. The log will be useful for troubleshooting connectivity problems and refusals. Here using OEM to alter on if the listener is available or using scripts to monitor the process to make sure they are running. Logging into the database could be part of another test and error messages would be similar to ORA-12560: TNS:protocol adapter error or other ORA – : TNS errors. However, not all TNS errors are listener issues which lead to the next item.
Hitting max processes can either cause a login to hang which makes it sometimes difficult to monitor, especially if you are logging into the database to monitor it. It can also give a TNS error, which can mislead you for a little while until looking through the alert log and finding the ORA-00020: maximum number of processes error in the log. Here a threshold is useful to monitor reaching a percent of max processes, so when it hits 80% there is time to research what is going on to prevent lock out of the database. When hitting the threshold, killing processes is an option to prevent reaching the max processes, because when needing to change the parameter will require a restart of the instance. Adjusting the processes parameter is not always the solution for this issue, but it might provide a Band-Aid to look at options and figure out why and where the processes are coming from.
3. Archive log and recovery area
Archive log space will cause the system to hang and give the dreaded ORA-00257: archiver error. Connect internal only, until freed error. If using file system space, monitoring the space available on the mount point is important. Setting thresholds to monitoring when 80 or 85% of space used is reached should allow time to back up the archive logs and remove them to free up space. When using the recovery area, the alert log will have warnings based on the thresholds that Oracle has set to monitor the recovery area. These are set at 85%, with a critical alert at 97%. Using the alert log could monitor these thresholds, or using v$flash_recovery_area_usage to monitor.
Space can be easily added to the recovery area by the db_recovery_dest_size parameter dynamically. This is an easy issue to prevent with properly sizing the archive log space and taking care of the backups of the archive logs, which makes this a frustrating error to run into and makes monitoring this area even more important.
4. Space, tablespace and filesystem
More space issues that can cause issues, but these space issues may not cause a system wide outage like the issues listed already; this depends on how tablespaces are distributed. The system tablespaces, such as SYSTEM, TEMP, UNDO, SYSAUX, can cause more issues but are fairly easy to monitor and set thresholds in the monitoring by percent full. The trick here with all of the tablespace is accounting for auto extend, and if you have very large tablespaces, the percent thresholds might not be appropriate, and just provide alert noise if 10% free still has 100 to 200 GB free.
Being able to monitor both percent and size is useful, and I have normally just setup some shell scripts to do this, which then can be used in the monitoring tools. Also with the auto extending of the tablespace the secondary problem is the file system space, and that is easy enough to monitor with OS tools or monitoring tools. Another good thing to monitor to be proactive about space is monitor based on the auto extending and whether that could grow past the size of the file system.
5. Database availability
Whether the database is up or down is probably very obvious thing to monitor for and may seem simple too. You can check for smon, pmon and other processes to make sure that they are running, and this will work if you don’t run into other issues, like archive space, processes or tablespace issues. Even if the processes are available, the login to the database will either hang or show another error, so actually logging into the database and a simple select will show if access to the database is available.
This is by far not a complete list of things to monitor for, but they seem like issues that pop up every time they are not being monitored. Most of these do cause issues that will cause the database instance to hang or not allow any new connections. They are simple things to monitor for in available tools or simple scripts and they have their little tricks to provide the right level of proactive monitoring.