A few months ago I started a new position as a contract Oracle DBA. This particular client had been without an Oracle DBA for about six months and, consequently, had developed definite performance issues over that time. Additionally, it had been several years since I last actively worked as an Oracle DBA and I really needed to exercise my skills while learning new ones.
The biggest problems I have run into have all been related non-existent Oracle procedures at the client site, very little or no documentation, and almost non-existent staff knowledge. Nobody could tell me, with certainty, what operational Oracle instances were installed, where they were located, what level of software they were, or what, if any, daily, weekly, or monthly procedures were in effect.
My skills were immediately put to the test when the client informed me, during my first week, they were moving their most active and primary instance to a new server in two weeks. Unfortunately they neglected to mention they were also upgrading the primary application for that instance. Because I had just arrived, my role was to support the SQL Server DBA who had engineered the move. They had already set up the Windows 2000 Advanced Server, installed Oracle 8.1.7.4 and now planned to restore a backup of the Oracle instance to the new server. Pretty straight forward, so far.
Apparently the user community had been complaining for quite some time about the database performance and the answer was to “throw hardware and memory at it”. The new server was the latest and greatest hardware with a newer and faster CPU and an additional 4G of memory for a total of 8G.
The restore of the database instance to the new server went smoothly and initial application testing worked without errors. However, on the next working day (we upgraded over the weekend), the “stuff” hit the fan about mid-morning. Transactions started bogging down and the result was database “gridlock”. Now begins the interesting part: as mentioned earlier, the internal Oracle environment knowledge is almost non-existent and I’m left trying to diagnose a sever problem without any tools or any scripts I could pull from a “bad of tricks” (remember, I had not been active in Oracle for several years). No one could tell me if perfstats had been installed or even if Oracle Enterprise Manager was installed an available. Fortunately I did remember some diagnostic skills and I started at square one.
The pressure was on and it was mounting quickly! I held my ground and then proceeded methodically, working my way through. I first checked the alert log and could find no errors popping out at me. A view of the UDUMP folder did not how anything either. Transactions were not abending, they were just taking a very long time to complete. I quickly cobbled together some SQL scripts to get data from the performance views. I still had many of my Oracle manuals and the internet now contained extensive source examples available. The first thing I discovered was the db file sequential reads wait event was through the roof. Why would this event suddenly appear? The new server had newer and better drives and they had been configured the same as the old drives. The OS performance monitor did not indicate any problems with them either. The next question that comes to mind then is: what else could cause the physical I/O to increase dramatically? Recall that not only had the instance been moved but he primary application had also just been upgraded to a newer release. Could that have been the problem?
A call was placed to the application vendor while I started investigating some of the SQL. Although I immediately identified several SQL which could benefit from further analysis and changes, there were not any which I felt were the cause of our current problem.
What else could dramatically increase the amount of total physical I/O? I started looking into what was happening in the SGA and found the db_block_buffers would benefit from increasing that number. Since we now had 4G more memory available, we decided, in the short term, to help alleviate the problem, we would increase the count and hopefully it would provide enough relief to continue our problem analysis. The first attempt to increase db_block_buffers was an immediate failure. The database could not start because it exceeded the available memory of 2G! By this time the users were screaming for their system.
I decided we needed to back down the count enough to get under the 2G mark while we researched why Oracle was suddenly limited to 2G on an Advanced Server with 8G of memory. I lowered the count and was able to start the database. Although the transactions were now processing more quickly the users still complained that “things were not processing as quickly as they used to”. It was only a matter of a another hour when we started getting complaints of transactions failing and users having to try multiple times to connect. After looking at the listener log, I found we had hit another memory constraint: because the SGA was now hovering near the 2G limitation, there was little memory left for user connections. After only about 80 connections the memory was exhausted and new connections were refused. The original db instance used to support about 280 connections on average. We really needed to find out why we were now limited to 2G on this server.
Bear in mind now, my previous Oracle experience was entirely on HP 9000′s and not on Windows. My knowledge of Windows Server was very limited. After some internet searching, I approached the Server administrator group and asked them to verify they had set the “3GB” switch in the boot.ini for that server. They had previously said the server was set up “exactly” like the previous one. Upon viewing the boot.ini, we found the 3GB switch had NOT been set! Without this parameter Oracle (and any other application) is limited to 2G of memory (this is an OS limitation not an Oracle limitation).
We then added the 3GB switch, rebooted the server, re-adjusted the Oracle memory parms in the init.ora, started the database, and then all was back to relative normalcy. The users were now completing their transactions in adequate time and we were able to support the same amount of connections as before.
After several days of normal execution we were finally confident the problems were all the result of not having implemented the “3G” parameter on the server. However, I did continue with some basic maintenance tasks to try and increase performance. Remember, this database had no maintenance applied in at least seven months. I was fairly confident that most indexes would benefit from a rebuild and that I would get further improvement by running an analysis for the entire database.
I also continued to research the 3G parameter to find out if and how I could take advantage of the entire 8G of memory available. Just for your information: the 3GB switch allows 3GB aware” applications to address 3G of memory. 1G is always reserved for the OS. This allows for the total use of 4G of memory. Because the OS is 32 bit, 4G is the physical limitation for addressable memory by the OS. However, there is another boot.ini switch, know as PAE, which allows the OS to “remap” portions of memory above the 4G limitation with portions of addressable memory. Utilizing this feature, I was able to dramatically increase the the amount of db_block_buffers to fully utilize the entire 8G of available memory.
The db sequential reads wait event rarely ever shows up in Oracle waits now and the instance is stable enough for me to start addressing the serious lack of documented procedures and to acquaint myself with all the operational Oracle instances.

powered by performancing firefox

Leave a Reply

You must be logged in to post a comment.