|Performance Issues While Booting Multiple Scheduler Domains
1. Error Messages
3. Different Scenarios
|1. Error Messages
In many scenarios, it is reported that when there are multiple App Server or Process Scheduler Domains on a single machine, the Process Scheduler sometimes does not boot:
Users may see errors similar to these after starting several domains on the same machine:
085736.PEOPLESOFT!JSH.4040: JOLT_CAT:1008: ERROR: Could not establish listening address on network
085736.PEOPLESOFT!JSL.3760: JOLT_CAT:1079: ERROR: Error starting minimum number of handlers
122142.PEOPLESOFT!PSPUBHND.3198: LIBTUX_CAT:250: ERROR: tpsvrinit() failed
122142.PEOPLESOFT!tmboot.1596: 12212002: TUXEDO Version 6.5 32-bit Windows.
122142.PEOPLESOFT!tmboot.1596: tmboot: CMDTUX_CAT:827: ERROR: Fatal error encountered; initiating user error handler.
074704.sifd!PSRCCBL.29548: LIBTUX_CAT:271: ERROR: System lock semop failure, key = 54116 (errno = 28)
074704.sifd!PSRCCBL.29548: LIBTUX_CAT:248: ERROR: System init function failed, Uunixerr = : semop: No space left on device
125432.elerpfa11!BBL.8918.1078784256.0: LIBTUX_CAT:271: ERROR: System lock semop failure, key = 217338 (errno = 22)
125432.elerpfa11!BBL.8918.1078784256.0: LIBTUX_CAT:268: ERROR: Failed to stop serving
This performance issues happen with Process Schedulers running on either Windows and Unix operating systems.
Fixing this problem requires changing certain values in the Registry and restarting the machine.
1. Run the Registry Editor (regedit).
2. Under the HKEY_LOCAL_MACHINE subtree, go to the following subkey:
The default data for this value will look something like the following, (all on one line):
SharedSection=1024,3072 Windows=On SubSystemType=Windows
Users will need to make the following change to this value:
Scan along the line until one reaches the part that defines the SharedSection values and add ",1024" after the second number.
This value should now look something like the following:
SharedSection=1024,3072,1024 Windows=On SubSystemType=Windows
Some machines may already have a third value and sometimes a fourth, like this:
Users only need to increase the third value from left to right, so in the example above increase the 512 next to 3072 to 1024.
After making this change, close regedit and restart the server.
If this does not resolve your issue, increase the value to 2048, restart server and try again.
Keep incrementing by 1024 intervals up to 10240 until all domains bootup OK.
On one instance, where these parameters were set to 1024, 4096, 4096, a significant performance delay occurred at the Jolt layer.
The delay was from 7 to 30 seconds.
There was only one Appserver running on the server.
Reducing the parameters to 1024, 4096, 1024 resolved the problem.
In one different scenario, the Windows (registry) sub-key didn't exist at all.
Once it was added, the problem was resolved.
These error messages appear when the machine is running out of resources, particularly at the kernel parameter level.
Check the parameters SEMMNS, SEMMNI, SEMMNU, SEMMAP or SEMUME with the System Administrator. It is possible one of those limits has been reached.
These values needs to be increased to resolve the issue.
The following values for these parameters need to be increased according to the following settings.
Reboot the machine after making these changes.
Maximum number of semaphores in the system.
Each structure uses 16 bytes.
This parameter should be set to semmni x semmsl.
The default is 60; the maximum is 2GB.
Maximum number of system wide semaphore sets.
Each control structure consumes 84 bytes.
The default setting is 10, the maximum is 65536 (64 KB).
Maximum number of undo structures in the system.
This should be set to semmni so that each control structure has an undo structure.
The default is 30, the maximum is 2 GB.
This sets the number of entries in the semaphore map.
This should never be greater than semmni.
If the number of semaphores per semaphore set used by the application is "n" then set semmap = ((semmni + n - 1)/n)+1 or more.
Alternatively, we can set semmap to semmni x semmsl.
An undersized semmap leads to "WARNING: rmfree map overflow" errors.
The default setting is 10.
Maximum number of undo structures per process.
This should be set to semopm times the number of processes that will be using semaphores at any one time.
The default is 10; the maximum is 2 GB.
|3. Different Scenarios
1. Message LIBTUX_CAT:271
This message will occur if a process was about to perform a semaphore operation when the IPC resources were removed.
2. Message LIBTUX_CAT:666 ERROR: Message operation failed because the queue was removed
3. Message LIBTUX_CAT:669
This message will occur when a process attempts to send a message to a queue that has already been removed.
This generally means that a server was processing a client request when the client's queue was removed and the server then tried to send a reply to that queue.
(One can get the same message without removing IPC resources if a client times out waiting for a long-running service to complete and exits the system, while the server sends the reply after the client has exited and its queue has been removed.)
This message also could occur if a client attempts to send a new request to a queue that has been removed.
In these situations, check the message queues, or contact the BEA TUXEDO system Technical Support.
See Also intro(2), msgsnd(2), msgrcv(2), msgctl(2) in UNIX system reference manuals.
|Performance Issues At Tuxedo Level
There are certain scenarios, where issues with available resources at the Tuxedo level, leads to some performance issues at the Scheduler level.
In these situations, increase the number of available resources for Tuxedo.
First navigate out to BEA Tuxedo, Start > Settings > Control Panel > BEA Tuxedo8.1 Administration > IPC Resources tab.
- Uncheck Use Default IPC Settings
- Click next to where it says IPC Resources it is a little box. When cursor is over the box it will show, New (Insert)
- Type in the word Custom and hit enter
- Now everything on the right turned from greyed out to white
- Change Maximum Number of Message Queues to 1024
- Maximum Number of Processes Using IPC 512
- Maximum Number of Semaphores to 2048
- Maximum Number of Semaphore Sets 2048
- Maximum Number of Semaphore Undo Structures 2048
- Hit Apply
The Tuxedo IPC Helper Service will then need to be restarted.
Make sure to shut down all domains and schedulers that are maintained by Tuxedo before making this change.
Similarly, please find the navigations for Tuxedo 9.1 and Tuxedo 10.3 versions:
Start > Settings > Control Panel > BEA Tuxedo9.1 Administration > IPC Resources tab
Start > Settings > Control Panel > Oracle Tuxedo Administration 10gR3 with VS 2005 > IPC Resources tab
Steps to free up abandoned IPC resources
|Steps to free up abandoned IPC resources
Tuxedo 8.1, 9.1 and 10.3
The following steps should be followed to cleanly and completely free up abandoned IPC resources;
Perform these steps for each Application server/Process Scheduler domain configured on the machine, not just for the domains that are up and running.
1. Issue a normal shutdown for each domain by selecting Administer Domain/Domain Shutdown Menu/Normal Shutdown.
If the shutdown process hangs, then attempt a forced shutdown.
2. At a UNIX prompt issue this command, (minus the quotes): 'ps -ef | grep BBL'.
This command will show the number of Tuxedo Domains that are up and running and under which Unix ID.
The '-U' parameter of the BBL will show which Application server configuration file, PSTUXCFG, was read in when booting this domain.
If this information is not truncated on your display, you may be able to determine which domain that BBL is booted for.
Note the UNIX user ID and group name that started the BBL processes.
3. If there are any BBL or other domain servers running you will need to terminate them with the unix 'kill' command.
You should terminate the BBL first then any remaining processes.
Use 'kill -15 pid' first and if the process does not stop use 'kill -9'
4. Delivered with Tools 7.01 and higher is a script called $PS_HOME\appserv\ipcrmall.sh, which is used to free up abandoned IPC resources.
If there is a partially started domain, a non-responding domain, or you have had to manually kill a Tuxedo domain or process, you may have abandoned IPC resources that will need to be released.
Running this script will allow you to free up all IPC resources that were allocated to a particular UNIX ID/group ID combination.
1. Ensure the COBOL License Manager is not running with a unix id equal to any of the domains you are trying to stop.
If it is you will need to stop the License Manager first before running the ipcrmall.sh script.
2. Make sure you also shutdown any process schedulers on this box that are started as tuxedo domain and with the same Unix account as the application servers before running the scripts below.
For example: ipcrmall.sh syntax: where the UNIX ID is 'psoft' and the UNIX group is 'psgroup' for the ID that booted the Appservers.
$. ./ipcrmall.sh psoft psgroup
The output of this script will be a new script called 'killipc.sh'.
Execute killipc.sh, to actually free up these abandoned resources.
Both ipcrmall.sh and killipc.sh are built to find the Korn shell at /bin/ksh.
If this is not the location of your Korn shell, then you may have to edit the first line of each of these scripts.
In PeopleTools 8.50, there is an option introduced within the Psadmin menu to clean IPC Resources.
This option is available under "PeopleSoft Process Scheduler Administration menu".
10) Clean IPC resources of a Process Scheduler Domain
Use the Clean IPC Resources of this domain option to clear the interprocess communication, (IPC), resources utilized by a domain.
When a domain shuts down normally, the IPC resources it was using is released as part of the shut down process.
However, if a domain terminates abnormally, in many cases the IPC resources are still assigned to the previous domain instance.
This option enables you to clean any orphaned IPC resources assigned to a domain.
5. Issue 'ipcs | grep UNIX_ID', where UNIX_ID is the ID used to boot the Appservers. i.e psoft. This list should now be blank.
6. Issue a 'ps -ef | grep UNIX_ID', substitute UNIX_ID with the same Unix ID that you used in step 4.
This will display the number of process initiated by this ID.
The processes relevant to the Appservers are:
BBL,WSL,WSH,JSL,JSH,JREPSRV,PSAUTH,PSAPPSRV, PSSAMSRV, PSQCKSRV, PSAPISRV and PSWATCHSRV.
None of these processes should be listed as running under that UID/GID combination.
7. Attempt to reboot the Appservers/Schedulers, one at a time.
If an Appserver/Process Scheduler fails, examine the TUXLOG.mmddyy and/or APPSRV.LOG for more information.
1. Kill the BBL then any other server running for the Domain then stop and restart the 'Tuxedo IPC Helper' control panel service.
Stop the service for approximately a minute before restarting it.
|Performance Issues While Running Different Type Of Process Types Via Process Scheduler
There are certain scenarios where it is reported that Processes remain in "Queued" status in the "Process Monitor" page for long.
In these type of situations, users may try the below suggestions.
1. Make the following changes, shutdown your scheduler, clear the cache, and reconfigure the scheduler before restarting.
In the Server definition:
Max API Aware: 7
App Engine Max Concurrent: 3
In the psprcs.cfg:
PSAESRV Max Instances: 5
Each Scheduler server has a Max API Aware number for Concurrent tasks.
Each Process Type has a Max Concurrent for that process type.
Max API =5
If you launch 3 of each process type, then there are 9 processes running.
At any given time, it will never go over the MAX API and will fill in as many processes as can fit as per the Scheduler's max concurrent value; therefore, maybe 3 COBOL processes and 2 AE processes made it in, then the 1 other AE and 3 SQRs would sit in queued, until the slot is freed.
2. Check the PSPRCSRQST and PSPRCSQUE to ensure they contain the same number of rows.
This can be found by running these SQL statements against the database:
SELECT COUNT (*) FROM PSPRCSRQST
SELECT COUNT (*) FROM PSPRCSQUE
SELECT COUNT * FROM PS_MESSAGE_LOG
If the count comes back out of sync, run these SQL statements to determine the rows that are out of sync (Oracle):
SELECT PRCSINSTANCE FROM PSPRCSRQST WHERE PRCSINSTANCE NOT IN
(SELECT PRCSINSTANCE FROM PSPRCSQUE)
SELECT PRCSINSTANCE FROM PSPRCSQUE WHERE PRCSINSTANCE NOT IN
(SELECT PRCSINSTANCE FROM PSPRCSRQST)
Delete any rows that are out of sync from the PSPRCSRQST and PSPRCSQUE tables.
Delete any orphaned older instances that are in the runstatus of "Processing", "Queued", "Initiated" etc.
Please see the attached document "Run Status.Doc", for the numeric values associated with different Run statuses.
3. Check if the Log_output folder has enough space to enable the next process to save any associated logs to the directory.
Purge the LOGS and log_output directories for all Process Schedulers down to a reasonable level.
The directory's contents should be small enough that the Windows file manager does not display the flashlight when navigating to the directory, or is not delayed significantly accessing the directory.
4. Check if there is any Ant-Virus Software running on the Scheduler machine.
If yes, please turn it off and then test.
5. Delete Process Scheduler cache and reboot the domain.
6. There may be insufficient Java Heap for the Process Schedulers.
Tune the psprcs.cfg configuration file for all Schedulers to include the following line, first ensuring there are enough resources available on the server.
After making this change the process scheduler must be re-configured using the Psadmin utility.
JavaVM Options=-Xmx256m -Xms128m
Examine this article additional information related to JVM options.
E-BI: Getting sporadic ETIMEOUT errors during credit card or echeck processing (Doc ID 662367.1).
7. There could be a corruption in one of the Scheduler's components, (psprcs.cfg, PSTUXCFG, or other file), which could be resolved by recreating the scheduler domain.
When recreating the Scheduler domain, recreate and configure the psprcs.cfg file from scratch using the PSADMIN utility.
Do not copy this file from an old domain as it is possible to apply the corruption to the new domain.
8. Check if CPU and Memory Utilization have been configured under :
Peopletools > Process Scheduler > Server definition.
If these values have not been configured, this can explain periods of inactivity for the Process Scheduler server and Distribution Server agents.
PSPRCSRV.3604 (0) [06/27/10 02:28:30 PSAPPSERV@MACHINENAME](3) CPU Threshold Setting : 75 percent
PSPRCSRV.3604 (0) [06/27/10 02:28:30 PSAPPSERV@MACHINENAME](3) Memory Threshold Setting : 75 percent
PSPRCSRV.1872 (0) [06/27/10 02:24:18](3) Server: PSNT1 checking status...
PSPRCSRV.1872 (0) [06/27/10 02:24:18](0) Server: PSNT1 processing is suspended
PSPRCSRV.1872 (0) [06/27/10 02:24:18](3) Server action mode: Suspending/Suspended
PSPRCSRV.1872 (0) [06/27/10 02:24:18](3) HeartBeat alarm on. Checking server state...
In these scenarios, this error message can appear in the Tuxedo log for the problem Scheduler domain.
LIBTUX_CAT:577: ERROR: Unable to register because the slot is already owned by another process