Analyzing API Performance Hour-of-Day Statistics

A recent post analyzed API performance by hour of day over a one-week period. The average performance of calls to the API was fairly consistent, except for calls made in the last hour of the day (the hour before Midnight Universal Time). This plot presents the analysis results:

The question is: why was average performance so much worse during the last hour of the day? Does this plot imply that customers were regularly seeing very slow performance when accessing your product during that time period?

The raw data used to create this plot is a series of hourly performance statistics gathered from the API Science Performance Report API. The requested data is returned in JSON format and read into a Python program that bins the data by hour of day and produces the plot shown above.

To gain deeper insight into the characteristics of the data, we’ll compute statistics (minimum, maximum, mean, and standard deviation) for the raw data in each hour-of-day bin. First, we initialize several Python NumPy arrays:

# initialize arrays
hourly_values = np.zeros(168, dtype=float)
hourly_values.shape = (24, 7)
hourly_count = np.zeros(24, dtype= int)
mins = np.zeros(24, dtype=float)
maxes = np.zeros(24, dtype=float)
means = np.zeros(24, dtype=float)
stds = np.zeros(24, dtype=float)

The hourly_values array holds the performance values for each hour of day. The hourly_count array is needed for the case in which there was no data on one or more days for a particular hour of day. These arrays are filled as follows:

for i in range(n_results):
    this_hour = int(perf['data'][i]['startPeriod'][11:13])
    avgTot = perf['data'][i]['averageTotal']
    if avgTot:
        hourly_values[this_hour, hourly_count[this_hour]] = float(avgTot)
        hourly_count[this_hour] += 1

We extract the hour of day (this_hour) from the API Science performance JSON by selecting characters 11 and 12 from the startPeriod JSON element, which is a time-stamp (for example, "startPeriod":"2019-02-15T17:39:24.478Z"). The averageTotal will have a numeric value if data was recorded for that time period, or a value of null if no data was available. If data was available, it is appended to the hourly_values list for that hour, and the hourly count is incremented.

The statistical arrays, which store the computed statistics for each hour of day, are filled like this:

# compute performance statistics for each hour of day
for i in range(24):
    if hourly_count[i] > 0:
        vals = hourly_values[i][0:hourly_count[i]]
        mins[i] = vals.min()
        maxes[i] = vals.max()
        means[i] = vals.mean()
        stds[i] = vals.std()

The vals variable is the list of hourly values (up to seven) for that hour of day. The Python min(), max(), mean, and std functions compute the statistics for each list of values.

At this point, we have the statistics for each of hour-of-day across the week. To visualize this information, we’ll feed it into a Python MatPlotLib errorbar plot:

# create stacked errorbars
plt.errorbar(np.arange(24), means, stds, fmt='_k', lw=3)
plt.errorbar(np.arange(24), means, [means - mins, maxes - means],
        fmt='.k', ecolor='red', lw=1)
plt.xlim(-1, 24)
plt.ylabel('Total Milliseconds')
plt.xlabel('Hour of Day')
title = 'Monitor 1572022 Past Week Hour of Day Performance Stats'

Here’s the resultant plot for the performance data we’re examining:

The Hour 23 statistics immediately stand out. Meanwhile, the performance for the first 23 hours of the day (Hours 0-22) is fairly consistent.

But what is this plot actually telling us from a statistical point of view? The horizontal line in each statistical column is the mean. These are an alternative representation of the heights of the bars in the chart presented in the beginning of this post. That’s all that the bar chart can tell us.

Here, though, we’ve visualized important additional information. The wide black bar extending upward and downward from the mean represents the standard deviation of data set. The thin red line represents the range of the data (minimum and maximum values).

Studying the Hour 23 statistics, we see that the minimum performance timing is similar to the performance timings for other hours of the day. But the maximum time for an Hour 23 API call was above 3000 milliseconds, far higher than the maximum time during any other hour of the day.

So, something was awry during Hour 23 during that particular week. To analyze this further, we look at the individual performance data points for Hour 23:

Hour 23: [  209.36  3139.9    220.36   242.15   267.07   210.71   202.65]

One value stands out: 3139.9 milliseconds for a particular call to the API. All the other values lie within the normal range for other hours of the day.

The conclusion, then, is that the Hour 23 performance for calling the API is not normally different from the performance at other times of day. Rather, on a particular day of the week under study, the Hour 23 performance was very slow.

There are many ways a development team might cope with this situation. A time limit could be instituted, whereby if the API response is not received within a certain number of milliseconds, the software times out the call and moves on, producing a partial solution for the users. The benefit here is that the user is not left staring at an unresponsive screen (hence, thinking: “I wonder if some other company has a better app?”). Users receive the parts of their request that are readily available, then they move on.

This blog series has illustrated the type of analysis a product development team can perform using API monitoring data accessed from the API Science API, and how Python can be used to represent and visualize API performance data.

–Kevin Farnham