Category: WebMethods

The curious case of the MIA work orders?

F5 - Redirect users to a maintenance page

Working in IT, we deal with strange issues all the time. However, every once in a while, something would come up that leaves us scratching our heads for days. One such issue happened to us a few years back. It came back to me recently and this time, I thought to myself I should note it down.

The issue was first reported to us when users raised a ticket about missing work orders in TechnologyOne, the Finance Management System used by our client. Without work orders created in TechOne, the users won’t be able to report actual labour time or other costs. Thus, this is considered a high-priority issue.

Web Archiving 101 | Archives and Special Collections
F5 maintenance page for integration should not have HTTP 200 OK status

TechOne is integrated with Maximo using WebMethods, an enterprise integration platform. Unlike direct integration, these types of problems are usually easy to deal with when an enterprise integration tool is used. We simply look at the transaction log, identify the failed transactions and what caused them, fix the issue, and then resubmit the message. All good integration tools have such fundamental capabilities.

In this case, we looked at WebMethods’ transaction history and couldn’t find any traces of the missing work orders. We also spent quite some time digging through all of the log files of each server on the cluster to find any errors but couldn’t find anything relevant. Of course, that is the case because if there is an error, it should have been picked up and the system should raise alarms and email notifications to a few overlapped monitoring channels we set up for this client.

On the other hand, when we looked at Maximo’s Message Tracking and log files, everything looked normal with work orders published to WebMethods correctly without interruption. In other words, Maximo said it had sent the message, while WebMethods said it never received anything. This left us in limbo for a few days. And of course, when we had no clue, we did what we application people do best, we blamed the network guys.

The network team couldn’t find anything strange in their logs either. So, we let the issue slip for a few days without any real progress. During this, users kept reporting new missing work orders not knowing that I didn’t really do any troubleshooting work. I was staring at the screen mindlessly all day long. Then of course, when you stare at something long enough, the problem will reveal itself. With enough work orders reported, it became clear that all of the updates only went missing during a period between 9 to 11 PM regardless of the type of work orders or data entered. When this pattern was mentioned, it didn’t take long for someone to point out that this is usually the time when IT do their Windows patching.

When a server is being updated, IT would set the F5 Load Balancer to re-direct any user requests to a “Site Under Maintenance” page, which makes sense for a normal user accessing a service via the browser. The problem is that when Maximo published an integration message to WebMethods, it received the same web page, which is ok, as it doesn’t process any response. However, the status of the response is HTTP 200 which is not ok in this case. Since it’s an HTTP 200 OK status, Maximo thought the message had been accepted by WebMethods and thus marked it as a successful delivery. WebMethods, on the other hand, never received such a message.

The recommendation in this case is to set the status of the Maintenance page to something other than HTTP 2xx. When Maximo receives a status other than 2xx, it marks the message as a delivery failure. This means the administrator shall be notified if monitoring is set up. The failed message will be listed as an error and can be resubmitted using the Message Reprocessing app.

Due to the complex communication chain involved, I never heard back from the F5 team on what exactly was done to rectify the issue. However, from a quick search, it looks like it can be achieved easily by updating a rule in F5.

This same issue recently came back to me, so I had to note it down to my list of common issues with Load Balancer. I think it is also fun enough to deserve a separate post. If you made it this far and think it’s not fun. I hope at least it will be useful to you at some point.

Implement “Sleep” or “Wait” in WebMethods flow

I needed to send an external system a file import request. The external system would take some time to process the file before the import result can be queried. Making a status query immediately after the import request would always return an “import process is still running”. It’s best to wait for a few seconds before making the first attempt to query the import status.

It took quite a bit of time to look up the web for a “wait” or “sleep” function. Some posts suggested using Java flow, some recommended complex processes or involved an external library.

The easiest method I finally settled with is to use Repeat as follows:

Essentially, the flow would repeat 1 time in 5 seconds before getting to the next step (Main Mapping). The repeat loop does nothing other than just writing a line in the server log to make troubleshooting a bit easier.

The fun (and pain) of Kronos Integration

One of our clients undertook a massive IT transformation program which involved switching to a new financial management system, upgrading and rebuilding a plethora of interfaces among several systems, both internal and external to the business. Kronos was chosen to replace an old timesheet software and there was the need to integrate it with other systems such as Maximo and TechnologyOne. WebMethods was used as the integration tool for this IT ecosystem. This is my first experience with Kronos. The project took almost two years to finish. As always, when dealingwith something new, I had quite a bit of fun (and pain) during this project. As it is approaching the final stage now, I think I should write down what I’ve learnt. Hopefully, it will be useful for people out there who’re doing a
similar task.

REST API: Kronos provides a pretty good reference source for the REST API at this Link. REST API theoretically offers the advantage of supporting real-time integration and enables seamless workflow. However, we don’t have such a requirement in this project. On the other hand, this has two major limitations.

One is the API throttling limit, which essentially restricts the number of calls you can make based on the license purchased.

The other limitation is, it is obvious to me that the API was built for internal use of the application. It is not meant to be used for external integration. No one told us about this when we first started. As a result, we hit several major obstacles along the way.

For example, most API calls will need to be method-specific. Cost Center requests need to be either Create New, Update, or Move; there is no Sync or Merge operation. The Update and Move requests will accept Kronos’ internal ID only. When sending an update or move request, we need to send another request first to retrieve the internal ID of a record.

Cost Center is a simple master data structure with a few fields, however, to get it to work, we had to build some complex logic to query Kronos to determine whether the record exists (and whether the parent exists) to send in the appropriate create new/update/move calls.

The complexity is taken to another level as we needed to build caching mechanism to pre-load and refresh the data at a suitable time so that the number of requests sent to Kronos is kept at minimal.

For a more complex data structure such as the Employee master data, if we use the REST API, it is impossible to build an interface that is robust enough for a large scale, high volume environment. I felt like we had to build the whole application layer in WebMethods to handle all sort of logic, scenarios, and possible exceptions that could occur. The process to create a new employee record can result in more than a dozen different requests to check existing data and lookup internal Id of different profiles (security, schedule,timesheet, holiday, pay calculation profiles etc.), then send in the Create New/Update requests in the correct order, ensure proper handling of exception and roll back if one request fails due to various reasons.

REST API reference guide (This page shows the API Throttling limits)

Report API: Kronos has a REST API to execute reports (both out-of-the-box and customized). This is useful to alleviate some of the problems with the API throttling limit. For example, we have an interface to send organisation hierarchy (departments and job positions) to Kronos as Cost Centers.

The source system would periodically export its whole data set to a CSV file, and we’ll need to query Kronos to determine if the record exists to either send a Create or an Update request, and if the record has new changes to send in update/move requests. We used the Report API to retrieve the whole set of Cost Center data in one single call rather than having to make thousands of individual cost centre detail requests.

Import API: This turned out to the best way to send data to Kronos. We learnt it the hard way. The import API still has a few minor limitations such as some APIs uses description to identify a record instead of an ID, or documentation sometimes is not accurate. Overall, this provides the capability to bulk upload, auto translating ID, and runs on “merge” operation (i.e. automatically decide to create new or update depends on whether a record already exists or not).

Since this is an asynchronous operation, and the time it takes to process inbound data depends on the volume, there is a need to build a custom response handler to request (and retry) Kronos later to retrieve the status of the import job and handle success/failure. This custom response handling takes some extra effort to build, but it can be reused for different import endpoints.

With the Employee interface mentioned earlier, at a point, it became way too complex and a maintenance nightmare. We had to rebuild it from scratch using the Import API and we’re glad that we did. It was greatly simplified, and we are now very confident of its robustness..

List of Import APIs which can be seen after logged in to Kronos

To conclude, if I have to build a new Kronos interface now, for retrieving data from Kronos and sending it to an external system, I will start with using reports to identify new changes, then use the REST API to retrieve details of individual records if necessary. To send data to Kronos, I would look at the Import API first, and will only go for the REST API if the Import API cannot do what I want to do and only if the request is very simple and low volume.   

WebMethods: Evaluate String IN and CONTAINS operator

In WebMethods, the most basic way to write a string “IN”
operator is to use Branch as follows:

Another way to reduce the number of lines of code is by combining
the conditions using “OR”:

These approaches works well if the number of options is
small or if the variable name is short. If there are more than a few options,
or in the case of a very long variable name, the code will look very messy, and thus, difficult to maintain. For example: 
$work_order.maximo/ns:PublishZZWORKORDER/ns:ZZWORKORDERSet/ns:WORKORDER[0]/ns:STATUS/*body

Using regular expression can simplify the code:

To implement the string “IN” logic, we can use /^value$/.
For example, the code below would evaluate to true if $input is exactly one
or two or three


To implement the string “CONTAINS” logic, we can use /value/.
For example, the below would return true if $input string contains one:

With Regex, it is also simple to check if the variable
contains one of several values:

My favourite Martin Fowler’s quote isAny fool can
write code that a computer can understand
. Good programmers write code that humans can understand. This trick helps me to keep most of my code fits in
a single screen.

Happy coding.

Setting up alarms for integration

When writing a piece of software, we are in total control of the quality of the product. With integration, many elements are not under our control. Network and firewall are usually managed by IT. With external systems, we usually don’t know how they work, or many times, not given access. Yet, any changes to these elements can cause our interfaces to fail.

 

For synchronous interfaces, the user would receive instant feedback after each action is taken (e.g. Maximo  GIS integration), thus, we don’t usually need to set up alarms. For asynchronous interfaces, which usually run in the background, and don’t give instant feedback, when failure occurs, it usually goes unnoticed. In many cases, we only find out about failures after it has caused some major damage.

A good interface must provide an adequate mechanism to handle failures, and in the case of async integration, proper alarms and reports should be set up so that failures are captured and handled proactively by IT and application administrators.

On the one hand, it is bad to have no monitoring. On the other hand, it is even worse to have too many alarms to the point that people completely ignore everything including the critical issues. This is usually seen in larger organisations. Many readers of this blog probably won’t be surprised when they open the Message Reprocessing app of the Maximo system they manage and find thousands of unprocessed errors in there. It’s likely that those issues have been accumulated and not dealt with for years.

 

It is hard to create a perfect design from day one and build an interface that works smoothly after the first release. There are many different kinds of problems an external system can throw at us and it is not easy to envision all possible failure modes. As such, we should expect and plan for an intensive monitoring and stabilizing period of a few days to one or two weeks after the first release.

As a rule of thumb, an interface should always be monitored and raise alarms when a failure occurs. It should also provide a mechanism to resubmit/reprocess a failed message. More importantly, there shouldn’t be more than a few alarms raised per day on average from each interface, no matter how critical and high volume the integration is. More than that, it will become too noisy and people will start ignoring those alarms. If an interface raises more than a few alarms a day, there must be some recurring patterns and each of them must be treated as a systemic issue. The software should be rebuilt or updated to handle these recurring issues.

It is easier said than done, and every interface is a continuous learning and improvement process for me. Below are some examples of the interfaces I built or dealt with recently. I hope you find it entertaining to read.

 

Case #1: Integration of Intelligent Transport System  to Maximo

An infrastructure construction company built and is now operating a freeway in Sydney. Maximo is used to manage maintenance activities, mainly on civil infrastructure. Toll point equipment and a traffic monitoring system were provided by an external provider (Kapsch). Device status and maintenance work from this system are exported daily as CSV files and sent to Maximo via SFTP.  On the Maximo side, the CSV files are imported using a few automation scripts triggered by a cron task.

The main goal of the interface is to maintain a consolidated database of all assets and maintenance activities in Maximo. It is a non-critical integration because even if it stops working for a day or two, it won’t cause a business disruption. However, occasionally, Kapsch would stop exporting CSV files for various reasons. The problem will only be found out after a while, like when an end-of-month report is produced or when someone tries to look up the status of a certain work order that was not created via the interface. Since we don’t have any access or visibility to the traffic monitoring system managed by Kaspch, we’ll need to build the monitoring and alarms in Maximo.

The difficulty is, when the interface on Kapsch’s side fails, it doesn’t send Maximo anything, there would be no import, and thus no errors or faults seen by Maximo to raise any alarm. The solution we came up with is having a custom logging table in which we write each import as an entry with some basic statistics including import start time, end time, total records processed and the number of records that failed. The statistics are displayed on the Start Center.

For alarm, since this integration is non-critical, an escalation is set to monitor whether there has been no new import within the last 24 hours, Maximo will send out an email to me and the people involved. There are actually a few different interfaces in this integration, such as for device list and preventive maintenance work coming from TrafficCom, or corrective work on faults coming from JIRA. Thus, sometimes, when a system stopped running for various planned or unplanned reasons, I would receive multiple emails for a couple of days in a row, which is too much. So, I tweaked it even further by sending only one email on the first day if one or more interfaces stopped working, and another email reminding me a week later if the issue had not been rectified. After the initial fine-tuning period, the support team on Kapsch and Maximo’s side is added to the recipient list, and after almost two years now, the integration has been running satisfactorily. In other words, there have been a few times files were not received on the Maximo side and the support people involved were always informed and able to take corrective action in a timely manner before the end-users could notice.

 

Case #2: Integration of CRM and Maximo

A water utility in Queensland uses Maximo for managing infrastructure assets, tracking, and dispatching work to field crews. When a customer calls up requesting a new connection or reporting a problem, the details are entered to a CRM system by the company’s call centre. The request will then be sent to Maximo as a new SR, and then turned into work orders. When the work order is scheduled and a crew has been dispatched, these status updates are sent back to CRM. At any time, if the customer calls up to check on the status of the request, the call centre should be able to provide an answer by looking up the details of the ticket in CRM only. Certain types of problems have high priority such as major leaks or water quality issues. Some issues have SLA with response time in minutes. As such, this integration is highly critical.

WebMethods is used as a middleware to handle this integration, and as part of the steps for sending new SR from CRM to Maximo, the service address will also need to be cross-checked with ArcGIS for verification and standardization. As you can see, there are multiple points of failure with this integration.

This integration was built several years ago and there has been some level of alarms set up in CRM on a few points where there is a high risk of failure such as when a Service Order is created but not picked up by WebMehods or picked up but not sent to Maximo. Despite this, the interface would have some issues every few weeks, and thus, needed to be rebuilt. In addition to existing alarms coming from CRM, several new alarm points were added in Maximo and Webmethods:

  • When WM couldn’t talk with CRM to retrieve a new Service Order
  • When WM couldn’t send a status update back to CRM
  • When WM couldn’t talk to Maximo
  • When Maximo couldn’t publish messages to WM

These apply to individual messages coming in and out of Maximo and CRM and any failure would result in an email sent to the developer and the support team.

In the first few days after this new interface was released to Production, the team received a few hundred alarms each day. My capacity to troubleshoot was about a dozen of those alarms a day. Thus, instead of trying to solve them. We tried to identify all recurring patterns of issues and address them by modifying the interface design, and business process, or fixing bad data. A great deal of time was also spent on trying to improve the alarms, such as for each type of issue, detailed error messages, or in many cases, the content of the XML message itself is attached to the email alarm. A new “fix patch” was released to Production about two weeks after the first release, and after that, the integration only produced a few alarms per month. In most cases, the support person can immediately tell what is the cause of the problem by just looking at the email before even logging in to the client’s environment. After almost a year now, all of the possible failure points that we envision, no matter how low of a chance it can occur, have failed and raised alarms at least once, and the support team has always been on top of it. I’m glad that we had put in all of those monitoring in the first place. And as a result, I haven’t heard of any issues that have not been fixed before the end-users become aware of it.

 

Case #3: Interface with medium criticality/frequency

Of the two examples above, one is low frequency/low criticality; the other is high frequency and highly critical. Most interfaces are somewhere in the middle. Those interfaces that are highly critical but don’t run frequently or don’t need short response time can also be put into this category. In such cases, we might not need to send individual alarms in real time. Although I like to think of myself as pretty good at troubleshooting issues, I don’t think I can handle more than a few issues per day. As such, my rule of thumb is, if I receive more than a few alarms per day, it is too much. As developers, if we don’t think we can handle more than a few alarms a day, I think we shouldn’t do that to the support team (giving them alarms all day long). For the utility company mentioned above, when WebMethods was first deployed, the WM developer configured a bi-daily report that lists all failed transactions that occurred in the last 12 hours. Thus, for most interfaces, we don’t need to set up any specific alarms. If there are a few failures, they will show up in the report and will be looked at by technical support at noon or at the end of the day. This appears to work really well, even for some of the very critical interfaces such as bank transfer orders or invoice payments.

 

Case #4: Recurring failure resulting in too much alarm

For the integration mentioned in #1 and #2, the key to getting them to work satisfactorily is to spend some time after the first release to monitor the interfaces and fine-tune both the interface itself and the alarms. It is important to have alarms raised when failure occurs, but it is also important to ensure there aren’t too many alarms raised. Not only people will ignore it if they receive too many alarms,  it also makes it hard to tell the critical issues apart from other noisy, less important ones. From my experience, dealing with those noisy alarms is usually quite easy. Most of the time, the alarms come from a few recurring failures that are ignored. When people first look at it, they can easily be overwhelmed by the high number of issues they see initially, and thus, feel reluctant to deal with it. In many cases, I simply deal with each alarm/failure one by one, and carefully document the error message or symptom, and the solution for each problem on an Excel spreadsheet. Usually, after a few I’ve gone through a few issues, they would all come back to some recurring patterns that can easily be dealt with.

Example: a water utility uses an external asset register system, and the asset data is synchronized to Maximo in a near real-time frequency. The interface produces almost 1GB of SystemOut.log file each day causing the logging system to become useless. I looked at each error and documented them one by one. After about two hours, it was clear that 80% of the errors came from missing locations which were not synchronized. When the integration tries to create new assets under these locations, it will write a bunch of errors to SystemOut.log. I did a quick scan and wrote down all of the missing locations and quickly added them to Maximo using MXLoader. After that, the amount of error was greatly reduced. By doing occasional checks on the log files in the following few days, I was able to list all of the missing 30+ locations and able to remove all of the related errors. The remaining errors found in the log files were easily handled separately. Some critical issues only came under the radar of the business after that.

 

 

String Concatenation in WebMethods

Manipulating string is probably the most frequent operation we need to do when transforming data. Thus, I’d like to talk a bit about string concatenation in WebMethods. The most basic way to add two strings is to use the pub.string.concat service in the WmPublic package as shown below:

Image 01. pub.string.concat service

However, this approach is way too limited. For example, to combine a Unique ID string which consists of a Prefix, an auto ID number, a Suffix with separator in between such as: WO-10012-CM, we’ll need to use at minimum 4 lines of code, and a number of temporary variables. That’s crazy.

An alternative approach is to use variables substitution when assigning values to a variable as shown below. In this case, we can build a new string from an unlimited number of variables in just one line of code.

Image 02. using variables substitution

This approach doesn’t work if one of the input variables can have a Null value. In that case, it will give us an unwanted result as shown below:

Image 03. Bad result when input is Null

With Tundra library, we have a much more flexible tundra.string.concatenate service. It allows us to add unlimited number of strings in one line, and at the same time, have some other capabilities such as adding separator.

Image 04. Use tundra.string.concatenate

In the example above, I can achieve the same result in one line of code, and the tundra service is smart enough not to add the second separator when the $suffix variable is Null.

However, beware of the annoying bug with WebMethods that it doesn’t have ability to set the order of input variables. Thus, when you make an update to one input, for example, in the example below, I changed the input variable str2 to map it to a different variable, it will move to the bottom of the list, and thus, messed up the final output.

Image 05. Last updated variable is moved to the bottom

A workaround is to update the other variables manually to correct the order. But from my experience, anything having more than 3 inputs will be a maintenance nightmare.

Image 06. A “quick” update messed up the order of concatenation

The more robust approach is to write our own concatenate service as part of a common service package. It can take a bit of time, but if we have a big project, it’s worth the effort in building some common service library to be reused across the entire environment. Ten minutes spending on building this service can save the whole team countless number of hours in development and maintenance.

Edit: After reading this post, Lachlan Dowding, Tundra library’s author, told me we can use String List to workaround the WebMethods’s bug on changing the order of input strings when using the Tundra’s concatenate.

Implement If-Then-Else Logic in WebMethods

Conditional Logic is the most important building block of any software development tool. WebMethods is not a programming language, but since we use it to build integration interface, which is also software, it means we are also programming with it. Writing a simple “If-Then-Else” condition in WebMethod is way too verbose to me though. The official tutorial on SoftwareAG teaches us to implement an if-then-else logic using the Branch, Sequence, and Map nodes as depicted in the sample below:

As we can see, one simple If-Then-Else condition will require at least 5 lines to achieve. A more complex Case condition with multiple branches can take up half of a screen’s real estate. This makes the code harder to read.

My preferred alternative approach is to use the “Copy Condition” when mapping variables as shown below:

If the condition specified in this “Copy Condition” is True, the value will be mapped from the left-side variable to the right-side variable. In this case, value of the $valuelist/Approved variable will be mapped to the $output variable if condition $var1 = ‘A’ is True.

To achieve the default (Else) branch, we use the NOT (!) operator such as: 

! ($var1 = ‘A’ OR $var1 = ‘W’ OR $var1 = ‘C’)

By doing this, no matter how many conditions we have, we can fit everything in one line.

 

 

© 2024 Maximo Consultant

Theme by Anders NorenUp ↑