Efficiently Calculating Averages of Real-Time Data

Average calculation gets slower as data increases? Let's solve it this way!

Aug 11, 2019

Efficiently Calculating Averages of Real-Time Data

In this post, I want to briefly explain how to efficiently calculate averages of rapidly accumulating real-time data. Cases needing to calculate real-time data averages are quite common - like calculating average response time of responses accumulating in server engine access logs, or calculating averages of values coming from sensors.

Since this data can accumulate at intervals as fast as 1ms in some cases, performance for processing data quickly as soon as it’s received is extremely important.

When we need to calculate averages from this data in real-time, the typical average formula that comes to mind would be sum of all data / length of data array. The average value calculated like this is called the arithmetic mean.

However, the typical method we use to calculate arithmetic mean doesn’t really fit systems needing to process data quickly in real-time. Why is that?

Problems with Typical Arithmetic Mean Formula

The formula that typically comes to mind when calculating arithmetic mean is:

Average(n) = \frac{1}{n}\sum_{k=1}^{n}x_k = \frac{x_1 + x_2 + ... + x_n}{n}

While the formula looks a bit complex, ultimately it’s initializing $k$ to 1, iterating until $k$ becomes $n$ , adding everything from $x_1$ to $x_n$ , then finally multiplying by $\frac{1}{n}$ - basically the typical average algorithm we learned in school. This method of collecting all data and calculating at once is called batch expression.

The batch expression’s disadvantage is exactly that time complexity is $O(n)$ . While this isn’t a big problem when data quantity is small, it becomes a fatal disadvantage in systems needing to quickly process continuously accumulating real-time data.

Let’s simply check the execution time of a batch expression average algorithm repeatedly calculating the average from 1 to 100,000.

let avg = 0;
const numbers = [];

function average (numbers = []) {
  const sum = numbers.reduce((prev, current) => prev + current, 0);
  return sum / numbers.length;
}

console.time('avg1');
for (let k = 1; k < 100001; k++) {
  numbers.push(k);
  avg = average(numbers);
}
console.timeEnd('avg1');
console.log(`So the average is? -> ${avg}`);

I wrote simple code mimicking real-time data processing. New data repeatedly enters the numbers array and calculates a new average each time. Since the first for loop is for implementing the real-time data input environment, excluding it from time complexity and looking only inside the getAvg function, you can see that each time calculating an average, it traverses the entire accumulated data array from the beginning to find the sum.

So ultimately, the time this algorithm took to calculate the average of 100,000 data points…

$ node average.js
avg1: 8146.261ms
So the average is? -> 50000.5

About 8146ms? Roughly 8 seconds - an extremely long time. In real-time data processing systems, if processing some value takes 8 seconds, that’s just ruined. (Actually even exceeding 1 second means ruined.)

Plus, due to real-time data characteristics, data quantity inevitably keeps accumulating and increasing - if processing 100,000 takes 8 seconds, processing 1,000,000 would take roughly 80 seconds, right?

Actually, the batch expression’s disadvantage isn’t simply that it takes long, but like this, execution time increases proportionally as data increases.

Also, when calculating the average combining up to the $n$ th data point, you can’t use previous average calculation results at all, so ultimately you must store all data that’s come in so far, wasting unnecessary memory resources. So is there an $O(1)$ average algorithm that always has constant execution speed regardless of data increase?

Average Filter Algorithm

There is. It’s the average filter algorithm. This algorithm can reuse the average up to $n-1$ when calculating the average of the data set when the $n$ th data comes in. This kind of algorithm is called recurrence expression. (By the way, recurrence expression is a more comprehensive concept than recursive functions.)

Using the recurrence expression average filter algorithm, you don’t need to store all previous data, and no matter how many data points accumulate, it absolutely guarantees $O(1)$ time complexity. So this algorithm is used more in IoT and embedded fields needing to process data coming from sensors at short intervals in real-time. (I first learned this from a friend working in this field.)

But diving straight into math isn’t fun, so to aid understanding, this time I’ll first check code and execution time before explaining this algorithm.

let avg = 0;
function cumulativeAverage (prevAvg, newNumber, listLength) {
  const oldWeight = (listLength - 1) / listLength;
  const newWeight = 1 / listLength;
  return (prevAvg * oldWeight) + (newNumber * newWeight);
}

console.time('avg2');
for (let k = 1; k < 100001; k++) {
  avg = cumulativeAverage(avg, k, k);
}
console.timeEnd('avg2');
console.log(`So the average is? -> ${avg}`);

The cumulativeAverage function uses 3 arguments: average up to previous data, newly entered data, and total data count. And it immediately calculates the average through a simple formula. If there’s the biggest difference from the average function written above, it’s that there’s no iteration inside the function.

The batch expression average algorithm we saw above took 8 seconds to calculate the average of 100,000 data points in real-time, but this guy is on another level.

$ node average.js
avg: 4.631ms
So the average is? -> 50000.5

Ta-da, it went from 8000ms to 4ms. Well, this is actually a natural fact. Because you don’t need to traverse all data to calculate the average - you just need to repeat simple operations. Even increasing data count to 1,000,000, execution time is about 10ms, not changing much.

And the reason time increased when increasing data quantity from 100,000 to 1,000,000 in this code is just because the for loop runs longer - in real-time data processing systems, you don’t input data by iterating like this, but literally data comes in asynchronously continuously and executes the average function through event listeners, so ultimately in actual use cases you can think it always has uniform execution speed.

So what principle makes this algorithm work?

What Principle Makes This Work?

Googling “average filter algorithm” shows many people have already posted the process of deriving recurrence expression from batch expression. But due to these posts’ characteristics of feeling strongly like the writer’s own archive, there weren’t really any friendly explanations.

And such formula derivation explanation methods can’t be intuitive explanations for people not friendly with math. So I’ll try explaining with a more intuitive and simple feeling rather than the method of deriving recurrence expression from batch expression. Let’s first look at the average filter algorithm’s formula.

Average(n) = \frac{n-1}{n}Average(n-1) + \frac{1}{n}x_n

In this expression, $n$ is the total data length so far, $Average(n-1)$ is the average up to previous data, and $x_n$ is the newly entered data this time. So what does this expression mean that lets it calculate averages?

The important keyword in this expression is weight. The condition for us to abandon batch expression and use recurrence expression was exactly “being able to utilize previous average value.” To do that, we need to calculate the influence the previous average value to use for calculating the new average will have on the new average value.

Looking at the expression closely, the previous average value $Average(n-1)$ is multiplied by $\frac{n-1}{n}$ . Here, what $\frac{n-1}{n}$ means is exactly the weight. Also looking at the newly entered data $x_n$ , it’s multiplied by $\frac{1}{n}$ , which is also the new data’s weight.

For faster understanding, I’ll explain while looking at a simple example.

const prevData = [1, 2, 3, 4];
const prevAverage = (1 + 2 + 3 + 4) / 4; // 2.5

Here’s a simple dataset with length 4. I just hardcoded batch expression for prevAverage so you can understand more intuitively, and the assigned value becomes 2.5, the average of 1, 2, 3, 4.

When the value 5 newly enters this dataset, the array length will become 5. When calculating using batch expression, you’d add all from 1 to 5 then divide by array length 5, but using recurrence expression, you just multiply the existing 4 elements’ average 2.5 by weight 4/5, multiply newly entered 5 by weight 1/5, then add them together.

Let’s execute the code below in console and check the results.

const batch = (1 + 2 + 3 + 4 + 5) / 5;
const recurrenced = (2.5 * (4 / 5)) + (5 * (1 / 5));
console.log(batch, recurrenced);
// 3 3

Results output identically as 3 for both. In other words, simply put, it’s calculating and adding how much proportion the previous average’s value and new data have in the new average.

I think you probably roughly understand now. If you’re curious about the process of deriving recurrence expression from batch expression, Google “average filter” and you’ll find many posts written by others, so refer to those. I personally think understanding the abstract concept of how this algorithm works is more important than such formula derivation processes, so I won’t write separate derivation processes in this post.

That’s all for this post on efficiently calculating averages of real-time data.

Programming Algorithm Math Average Calculation Algorithm Average Algorithm Cumulative Average