Gathering Information for RDBMS Performance Tuning issue

Gathering Information for RDBMS Performance Tuning issue

ATHERING INFORMATION FOR RDBMS PERFORMANCE TUNING ISSUE

What is Performance?


Performance could mean different things to different users, for example a user accessing a web page will be interested in seeing the first page of the results as soon as possible (response time) and the processing for the other pages can carry on in the background. But a user submitting a job will be interested in seeing that the complete job completes as soon as possible, He is not interested in seeing a result page but the database should provide the best throughput (not response time). Some other applications may have some other definitions of what may consider good performance.

Problem Definition

Hence it becomes extremely important we are aware of the user expectation beforehand. The common questions those are always useful to ask go something like this:
  • What is the exact problem?
  • What is the expectation of the user?
  • Has the results asked for have ever been achieved before?
This list goes on (dealt in detail below based on problem type), but the idea is it to get a clear understanding of the problem. It is very important to quantify the problem so that we know when we have achieved the expected results.

Why did the performance problem occur?

This question is most relevant in databases, which were in production and performing well for sometime. This may not be applicable to the database that is in development or pre-production stages in which we are defining the performance benchmarks. This question is one of the first to be posted in the performance TAR by the users and one of the most important to be answered.

The understanding is if the system was performing well at one time then for it to suddenly start performing badly will require some inputs. Sometime this input is most obvious and specified by the user themselves like the following:
  • Upgrade of the database or application of any patch or patchset
  • Change of hardware
  • Change of platform
  • Recollection of statistics
  • Increase in user or data load
  • Schema changes
But sometimes this becomes the whole mystery of solving the TAR of why this happened. In such cases it becomes very important that we keep probing users of what they done have in the recent past, many times it so happens that the user may not have told us something that he found not relevant in this problem, but holds the answer to everything else.

Collecting Information for the Performance Problem

The problem type of tars generally has the following problem statement:
“Something is slow” or “Something is Hung” or “Query is not giving expected results”.
It is very important to understand what is slow or hung. It could be an application, some user sessions, some sql queries or the complete database itself.

NOTE: Oracle suggests customers that are submitting information via a Service Request (SR), always submit a RDA (Note 314422.1) with all performance related issues.

Common Set of Questions for all Kind of tuning issues:

  1. Instance or instances in which it can be observed. Does it happen in Production? Development? Test? All of them? If it does not happen in all your environments, document what is different among them. Try to determine if the difference among them can be isolated. This may narrow the problem to a very specific area.
  2. Determine if the poor performance is equally observed when the system load is either low or high. What if there is only one active user? Is it equally slow? Document if the poor performance keeps any relation to the system load, or not at all.
  3. Can you reproduce the issue systematically? Document how can the issue be systematically reproduced. If you can't, document why not. In any case, try to reproduce it systematically and document your findings.
  4. Do you have a workaround available? What is it?
  5. Determine if any software has been installed or upgraded in your server or instance since the problem started. 
  6. Document any patches applied recently that could have affected your performance. Include Operating System patches as well as patches from RDBMS or Apps.
Now we will take the above as different cases and depending on which is the case we will collect different information:

1) Some query is slow:

a. Since when is it slow

b. Was it ever running faster than this

c. If yes, what was changed
i. Any patches applied
ii. Stats regathering
iii. Different dataload
iv. Any init.ora changes
v. Any hardware changes (Disk, CPU etc)
vi. Any database version change
vii. Any schema changes (index add, rebuild, constraints)
d. If we have any database where the query is still performing well then:
1. Get the 10046 trace of the query from both the backup and the current system.

2. In case it is a change of plan then the 10053 traces from both the systems will also help.

ii. If we do not have any database where the query is still performing well then it becomes difficult, but we can check the following:

1. Do we have any old trace of the query when running fine?

2. Collect the 10046 trace and find the worst bottleneck for the query

3. If this is a problem with a change in plan due to a particular index not being used then 10053 trace will also help.

4. The plan of the query using the utlxpls/utlxplp scripts to understand the predicate application.
e. Potential Solutions: If the user reports a change in plan of query then the following can be used to solve/diagnose the issue:
i. Check if hinting the query is possible. If yes, then you can try with combinations. The usual hints used are:

1. INDEX
2. FULL
3. PARALLEL
4. NO_PARALLEL
5. NO_UNNEST

ii. See if we have the old stats for the relevant tables and indexes anywhere. We can restore them to the current system and check if the plan reverts back.

iii. If we know the good plan and using some init.ora parameters we can force the plan then we try to store Outlines (SQL Profiles 10g) for the query.

2) Application is slow:

a. Most of points discussed in point 1 are relevant here.

b. It is important to understand kind of application it is:
i. Which language is it coded in.
ii. What is the method of connection to the database : JDBC, ODBC, OCI etc
iii. What is the middleware technologies used.
iv. Is there any webserver or application server acting as an intermediatary.
v. Is there any connection pool of any kind.
vi. Is it using a dedicated session (at database level) that is being used by the application user.
vii. How are we sure that we are not spending much time in any other layer other than the database
1. Have we eliminated the network contention?

2. Any problem at the webserver or middleware.

3. Any settings changed in the recent past

c. We generally go about diagnosing these kinds of performance issues by tracing the session at the database level with the 10046 trace. But the following may be an exception to this case:
i. There is a connection pool at the middleware and we are not sure which database session the application is going to use
ii. There is shared server configuration being used at the database level so we are not sure what server session is going to be used.
iii. The application uses multiple slave sessions to accomplish its work.
d. We have to rely on application logging of some kind to understand where exactly we are spending time.

e. If the application is able to identify any particular module or sql query where we are spending time then we can at any point of time query v$sqlarea or v$session and find the session executing the query and trace that session.

3) Database is slow : This is the most common scenario

a. The first question in this case is also, “what has changed”
i. Any patches applied
ii. Stats regathering
iii. Different dataload
iv. Any init.ora changes
v. Any hardware changes (Disk, CPU etc)
vi. Any database version change
vii. Any schema changes (index add, rebuild, constraints)
b. The next is to gather statspack snapshots or awr (in 10g) at regular intervals of 30 mins each.
i. Generate a statspack report/AWR between multiple snapshots that were taken no more than 60 minutes apart from each other during the performance problem.

ii. But to effectively and quickly diagnose the issue always check if the customer has a statspack snapshot of the time when the database was running fine under similar workloads.

1. This works well as this helps in identifying what is different in this statspack compared to the old one.

2. The top event need not be culprit always. It is possible a new query or wait has cropped up that is causing this entire imbalance.

3. We can compare the workload to see if they actually match.

iii. If the old statspack is not available then check if the performance problem is occurring in any one of these circumstances:

1. A certain period of time during the day
2. A particular application runs
3. A particular user logs on into the system
4. When we run backup
5. When the second rac instance comes up
6. When we run any other application at the OS level
7. We have any cron job scheduled at that time.
c. Important to gather the alert log of the database to see if there are any messages reported here.

4) Database is hung:

a. To see if this is an actual hang or not.
i. Take some systemstate dumps at regular intervals and see if the sessions are moving forward
ii. Check if oracle is using very high CPU. This could be a spin situation and a truss of the process may be very helpful in this circumstances.
b. If this a hang then if we can establish a sys connection with the database then do the following:
i. Check the events for which the database sessions are waiting for in the view v$session_wait
ii. Take 2-3 systemstate dumps at an interval of 2-3 mins.
iii. Take a hanganalyze reports.
c. If this a hang of only some database sessions then we can do the following:
i. Set 10046 traces on the session.
ii. Take processstate dump
iii. The section b output above will also be helpful.

5) Query Giving Wrong results:

a. This is most likely a Bug.

b. Verify if what according to user is expected result is actually the expected result based on the query.

c. It is always useful to get a test case in such scenarios.

d. Wrong results usually happen due the application of query predicates at wrong places. So it is very useful to get the predicate information from utlxpls script.



For a more comprehensive analysis, refer to master notes below

Master Note: How to diagnose Database Performance - FAQ (Doc ID 402983.1)
Master Note: Query Tuning Overview (Doc ID 199083.1)

REFERENCES

NOTE:199083.1 - * Master Note: SQL Query Performance Overview
NOTE:210014.1 - How to Log a Good Performance Service Request
NOTE:314422.1 - Remote Diagnostic Agent (RDA) 4 - Getting Started
NOTE:402983.1 - * Master Note: Database Performance Overview
NOTE:61552.1 - Questions and Diagnostics to help Investigate Database Hanging Issues

=================================================================

Note---> This informationmation taken from oracle metalink. all copy rights oracle only.


Comments

  1. Great piece of information . Could you please share a doc on how to read hang analyze and system state dump .

    ReplyDelete

Post a Comment

Oracle DBA Information