Daniel Gustaw - Benford's Law for the Fibonacci Sequence in Java, Rust, and Node JS

It was 1992. In the town of Wayne (Arizona USA), a verdict was reached for James Nelson - the chief accountant and manager of Arizona State Treasurer. His false checks, through which he embezzled almost 2 million dollars, were detected because the distribution of the first digits in the swindled amounts deviated from Benford’s distribution.

On the first positions imagined by the accountant, the values 7, 8, and 9 were too often present - typical values perceived by us as “more” random than 1, 2, or 3.

From this entry, you will learn what Benford’s Distribution is and why it is observed in many data sets. Later, we will discuss the Fibonacci sequence and its basic properties. Finally, we will write a program to check if Benford’s Distribution applies to the Fibonacci sequence. The program will be written in three languages:

Java
Rust
Node JS

We will compare the results of its performance.

Benford’s Distribution

Benford’s Distribution is a probability distribution of the occurrence of specific numbers in the leading positions in many observed data sets. For it to occur, the following conditions must be met:

the set of values should span many orders of magnitude
the probability should be invariant with respect to scale and base

An example of a size distribution where the first digit approximately follows Benford’s law. The exponential decay of the distribution is evident as the value axis densifies.

Distribution of sizes encompassing one order of magnitude. Usually, the first digits do not follow Benford’s distribution if the initial distribution is not sufficiently wide.

A great formal derivation of Benford’s distribution was presented by Arno Berger and Theodore P. Hill in the publication: “A basic theory of Benford’s Law”

This is a more than 100-page publication that extensively discusses the topic and I recommend it to everyone who loves mathematics. A shorter and simpler derivation worth noting was written by Victor Romero-Rochin

Examples of distributions following Benford’s law are clearly shown at the link:

Testing Benford’s Law

An intuitive reason for the higher representation of lower digits is the higher probability of occurrence of many smaller values, which, when overlapping with the stepwise variable density of digits as the order of magnitude increases, causes a shift towards higher representation of lower digits in the first positions.

Since in this article, Benford’s distribution is merely a pretext for comparing the performance of programs written in various languages and not the main topic, I will allow myself to limit its description to showing the best publications, the derived formula, and a few examples.

The formula for the probability of the digit d occurring in the first position is:

Examples I will show come from the website [deltami.edu.pl](http://www.deltami.edu.pl/temat/matematyka/zastosowania/2016/03/21/Fenomen_rozkladu_Benforda/)

Uniform distribution of uniform distribution

From the set of natural numbers ranging from 1 to 9999, we randomly draw a number p, using a random number generator with a uniform distribution. Next, from the range of natural numbers from 1 to p, we randomly draw a number r, also using the uniform distribution.

Atomic mass of elements from the periodic table

Let’s take a look at the periodic table of chemical elements, more specifically, one of the parameters of each element - atomic mass.

Surface area of world countries in km²

The last example is related to geography - let’s take a look at the surface area of all the countries in the world in km².

Benford’s Law

Benford’s discrete distribution for the decimal system also known as the law of first (significant) digits.

As we see, all these sets of numbers have the same property – invariance with respect to scale, base, and extension by several orders of magnitude.

Fibonacci Sequence

The Fibonacci sequence is a sequence of natural numbers with a recursive definition:

where

Its initial values are:

1,1,2,3,5,8,13,21,34,55,89

This is a sequence that we can often observe in nature: in water vortices, in the shape of tornadoes, in the arrangement of flowers, in the branching of plants, and in the division of insect bodies. Its prevalence fascinates researchers of this phenomenon. Just like the prevalence of exponential or quadratic functions, it results from the simplicity of the formula and being a good approximation for much more complex systems observed in reality.

The ratios of successive values of the sequence converge to the golden ratio. The proof follows directly from the definition.

Java

To do this in Java, the import of the java.math.BigInteger module is required.

import java.math.BigInteger;

In the file Benford.java in the class Benford, we will create a function generateFibonacci that will allow us to prepare the sequence.

public class Benford {
    private static BigInteger[] generateFibonacci(int n) {
        BigInteger[] fib = new BigInteger[n];
        fib[0] = BigInteger.ONE;
        if(n == 1) return fib;
        fib[1] = BigInteger.ONE;
        for (int i = 2; i < n; i++)
            fib[i] = fib[i - 1].add(fib[i - 2]);
        return fib;
    }

It is worth noting that instead of 1 we use BigInteger.ONE to maintain type compatibility. Similarly, instead of classic addition by +, we use the add method defined on BigInteger objects.

In the main method, we prepare the Fibonacci sequence.

    public static void main(String[] args) {
        BigInteger[] numbers = generateFibonacci(
            args.length > 0 ? Integer.parseInt(args[0]) : 1000
        );

Thanks to args, we can use the argument entered by the user. If it is not provided, the default value is 1000.

Next, the digits array is filled with the counts of the digits.

        int[] digits = new int[10];

        for (BigInteger number : numbers)
            digits[Integer.valueOf(number.toString().substring(0, 1))]++;

At the end, we display a table comparing the results with the theoretical predictions.

        System.out.print("N   Ben        Fib\n");
        for (int i = 1; i < digits.length; i++)
            System.out.printf("%d %10.6f %10.6f\n",
                    i,
                    (double) digits[i] / numbers.length,
                    Math.log10(1.0 + 1.0 / i)
            );
    }
}

We execute the code by typing java Benford.java and get a result confirming our theory:

Rust

We start projects in Rust with the command

cargo new benford

a file Cargo.toml is created in the benford directory with the content

[package]
name = "b"
version = "0.1.0"
edition = "2018"

[dependencies]

and the file src/main.rs with the content

fn main() {
    println!("Hello, world!");
}

It’s very nice that Rust welcomes us in such a pleasant way, making it easier to start working with this language.

To compile the program, we execute the command.

cargo build

It can then be started using the command

./target/debug/benford

To compile and run the program simultaneously, we will use the command

cargo run

While in Java we used one package for handling large integers, in Rust we need two: num-bigint and num-traits. We will add them to the project by writing lines

num-bigint = "0.4.0"
num-traits = "0.2.14"

under the [dependencies] key in the Cargo.toml file. The versions of the packages will be automatically suggested by our IDE. Their use in the src/main.rs file requires writing

use num_bigint::BigUint;
use num_traits::{Zero, One};
use std::env;

Where Uint comes from unsigned integer, which are whole numbers that do not spare one bit for the sign, because they are always positive. The function generating the Fibonacci sequence will be similar to the one in Java.

fn generate_fibonacci(n: usize) -> Vec<BigUint> {
    let mut fib = vec![Zero::zero(); n];
    fib[0] = One::one();
    if n == 1 { return fib; }
    fib[1] = One::one();
    for i in 2..n {
        fib[i] = &fib[i - 1] + &fib[i - 2];
    }
    return fib;
}

We see that the main difference lies in naming the types. In the main function, we generate the sequence in the same way by saving it to an array.

fn main() {
    let args: Vec<String> = env::args().collect();

    let numbers = generate_fibonacci(
        if args.len() > 1 { (&args[1]).trim().parse().unwrap() }
        else { 100 }
    );

This time, the argument array starts with the program name and the value passed from the command line has an index of 1.

we prepare an array with the count of digits in the first positions

let mut digits = vec![0; 10];

An analogous record to that in Java allows us to count the digits and store the number of their occurrences in an array.

for n in numbers.iter() {
    digits[n.to_string()[..1].parse::<usize>().unwrap()] += 1;
}

At the end, we display the results in the console using the following loop.

    println!("N   Fib        Ben");
    for i in 1..digits.len() {
        println!("{:} {:10.6} {:10.6}",
                 i,
                 digits[i] as f64 / numbers.len() as f64,
                 (1.0 + 1.0 / i as f64).log10()
        );
    }
}

Node JS

A unique feature of the presented program is that, like few other projects in node js, it does not contain a list of required packages. We do not need to import any modules responsible for handling large numbers. Constants of type BigInt are created by adding the letter n after the number. As a result, the function for generating the Fibonacci sequence takes the form:

const generate_fibonacci = (n) => {
    let fib = [];
    fib[0] = 1n;
    if(n === 1) return fib;
    fib[1] = 1n;
    for (let i = 2; i < n; i++)
        fib[i] = fib[i - 1] + fib[i - 2];
    return fib;
};

However, we can easily imagine that someone writing code does not know the difference between 1n and 1 or simply forgot that they are working with large numbers and wrote it like this:

const generate_fibonacci = (n) => {
    let fib = [];
    fib[0] = 1;
    if(n === 1) return fib;
    fib[1] = 1;
    for (let i = 2; i < n; i++)
        fib[i] = fib[i - 1] + fib[i - 2];
    return fib;
};

To simulate both cases, let’s write a universal function controlled by the --cheat flag.

const generate_fibonacci = (n) => {
    let fib = [];
    fib[0] = process.argv[3] === '--cheat' ? 1 : 1n;
    if(n === 1) return fib;
    fib[1] = process.argv[3] === '--cheat' ? 1 : 1n;
    for (let i = 2; i < n; i++)
        fib[i] = fib[i - 1] + fib[i - 2];
    return fib;
};

In the further part, it will become apparent how colossal differences in performance and correctness of the program are made by this one symbol n. When writing software, it is important to understand what value ranges the program operates on and to correctly handle their boundaries.

In this regard, node requires particular responsibility from the programmer, as trying to save the program by throwing an error leads to compromises that can sometimes be brilliant, but can also be very deceptive.

We will use the generate_fibonacci function in the main function as follows

const main = () => {
    const numbers = generate_fibonacci(
       parseInt(process.argv[2]) || 1000
    );

Of course, in node we are not obliged to define a main function, but I consider it good practice for the program to have a clearly defined starting point and well delineated boundaries between declaring functions and procedures and using them.

By the way, you probably noticed that argv is indexed completely differently again. As you can see, each language has its own convention here, and this time the first two arguments are the directory and the program name.

An array of ten zeros, which will contain counts of the recorded first digits, can be declared as follows.

const digits = [...new Array(10)].map(() => 0);

Counting itself is just as simple as in other languages.

numbers.forEach(n =>
    digits[n.toString().substr(0, 1)]++
)

On the other hand, printing results instead of using a template where we input values as arguments directly uses template strings.

    process.stdout.write("N   Ben        Fib\n");
    for (let i = 1; i < digits.length; i++) {
        const ben = digits[i] / numbers.length;
        const fib = Math.log10(1 + 1 / i);
        process.stdout.write(
            `${i}   ${ben.toFixed(6)}   ${fib.toFixed(6)}\n`
        )
    }
}

At the end, we activate our program by calling the main function.

main();

Performance Comparison of Programs

By performance of programs, I mean the performance of compiled programs without counting the compilation time. Therefore, in the case of Java, we need to compile using the command

javac Benford.java

As a result of this command, a file Benford.class will be created.

For Rust, the compilation performed by cargo build creates a developer version that is not optimized. To create an optimized version, you need to add the release flag.

cargo build --release

For example, for n=1000, each program displays the same result, but the computation times vary.

Rust crushes the competition. Node.js shows the same results and very similar, even good time, regardless of whether we started from 1 or 1n. Java, despite significant cpu usage, takes so long to start that it performs the worst in this test.

For n=10000, the result of Java only increases by 10 times, even though Rust performs calculations two orders of magnitude longer, and Node 24 times longer.

Do not be misled by the fact that n has increased “only” 10 times. The values processed by the program grow at a geometric pace, quickly reaching gigantic values. For example, for n=10000, the value of the sequence is:

The difference in performance increase comes from the fact that Java has the heaviest startup process. Node, although quite lightweight, still requires loading the entire interpreter, which is why Rust, having the fastest startup, demonstrated how much computational complexity has actually increased.

Since the main burden here is adding larger and larger numbers, whose length increases linearly, we can expect a complexity of O(n^2), which Rust presents.

The last conclusion is that a program written in Node JS with the --cheat flag “did not notice” that it is running incorrectly. Its results show that despite the fast execution, it did not accurately count the leading digits. Knowing the limitations of the Number type in Node, we know that it cannot exceed the value of Number.MAX_VALUE, which is 1.7976931348623157e+308, while Log10[Fibonacci[1000]] equals 208.638, but Log10[Fibonacci[10000]] is already 2089.53. Therefore, the numbers that the program in Node adds are Infinity.

Of course, Infinity + Infinity = Infinity, which significantly reduces computation time, but the first “digit” of infinity for Node is I because we calculate it with the command.

n.toString().substr(0, 1)

If I stopped at the comparison of the pair of results for three programs, I wouldn’t be myself. Curiosity compels me to look deeper and prepare a chart showing how computation time increased with the length of the sequence.

I will also show the measurement point 50,000.

However, discussing each one individually is not as valuable as conducting a whole series of measurements and overlaying them on a common graph.

Measuring program performance depending on the argument

To effectively measure the performance of programs, we need to solve several problems

separate the program’s output streams from the performance measurement
choose a set of values for which we will conduct the measurement
draw the graphs

Separating the program stream from the time measurement stream

In bash, programs communicate by redirecting data streams. The output of one program can become the input of another, which may want to save the processed information to a file.

For a simple execution:

java Benford 10

result in the form of:

N   Ben        Fib
1   0.300000   0.301030
2   0.200000   0.176091
3   0.200000   0.124939
4   0.000000   0.096910
5   0.200000   0.079181
6   0.000000   0.066947
7   0.000000   0.057992
8   0.100000   0.051153
9   0.000000   0.045757

will be displayed in the terminal because the terminal is the default output for the data stream produced by this program. The data produced by the program defaults to being sent out through standard output. We can redirect it elsewhere using 1> or simply > and omit the 1, which is default.

Executing java Benford 10 > out will not show anything but will create a file with data from standard output.

However, when we precede the program with the command time, that is, we write

time java Benford 10

it will turn out that we will receive in the terminal

N   Ben        Fib
1   0.300000   0.301030
2   0.200000   0.176091
3   0.200000   0.124939
4   0.000000   0.096910
5   0.200000   0.079181
6   0.000000   0.066947
7   0.000000   0.057992
8   0.100000   0.051153
9   0.000000   0.045757
java Benford 10  0.12s user 0.02s system 153% cpu 0.091 total

however, attempting to capture the execution time to a file as before using > will end with displaying the line

java Benford 10  0.12s user 0.02s system 153% cpu 0.091 total

in the terminal, and all other output will be redirected to the file. This is because time does not mix its data with the data from the standard stream. Instead, it uses the error stream 2>.

Our goal is to hide the data from the standard stream. We can do this by redirecting it to /dev/null. This means

time java Benford 10 > /dev/null

However, the error stream is impossible for us to process unless we redirect it to the main stream. We will achieve this with the command

(time java Benford 10 > /dev/null) 2>&1

The result of these two looks the same, but the key difference is that in the second case, we can process the stream by redirecting it to awk.

For example, a command that involves data processing:

(time java Benford 10 > /dev/null) 2>&1 | awk '{print $1,10,$6,$10,$12}'

will only return on standard output

java 10 0.11s 154% 0.090

to clean these results of the s and % sign we can add

| tr -d "s%"

If we want to view this result while saving it to a file, tee comes to our aid - the third of my favorite tools alongside Kafka and Express.

Just add at the end:

| tee -a logs

and the shown line will be appended at the end of the logs file. Now let’s assume that we want to wrap the recently generated command in a loop iterating over the sequence:

for i in $(seq 5 5 25); do echo $i; done;

The sequence will display to us

But if we naively pasted $i into print in awk like this:

for i in $(seq 5 5 25); do (time java Benford $i > /dev/null) 2>&1 | awk '{print $1,$i,$6,$10,$12}' | tr -d "s%" | tee -a logs; done;

we would get a repeatedly repeated line

java java Benford $i > /dev/null  0.12s user 0.02s system 152% cpu 0.091 total 0.12 152 0.091

It is like this because i does not exist inside print unless we put it there. Therefore, $i is equal to $0, which corresponds to the entire line, not a selected column. To use variables inside the print context in awk, we can use the -v flag. The correct syntax of the command is:

for i in $(seq 5 5 25); do (time java Benford $i > /dev/null) 2>&1 | awk -v i=$i '{print $1,i,$6,$10,$12}' | tr -d "s%" | tee -a logs; done;

and its result is simultaneously writing to the logs file and displaying the line on the screen:

java 5 0.11 150 0.090
java 10 0.12 153 0.089
java 15 0.11 152 0.088
java 20 0.10 154 0.087
java 25 0.11 153 0.089

If the topic of streams in bash interests you, I recommend the introduction of Justin Albano.

Preparation of a series of values `n` for performance analysis

By dividing the measurement range into parts, one should increase the density of measurements where their cost is low (short program execution time) and variability and interesting behaviors are expected. For us, this is the change in the ratio of computation time to startup time (typical for small values of n). Thus, we have two reasons not to divide the measurement range into equal pieces and not to use seq. Instead, we can generate a series whose density decreases as n increases. For example, a module in Mathematica:

Module[{steps = 100, minY = 1, maxY = 50000, pow = 3},
   Table[maxY (minY + maxY (n)^pow)/(minY + maxY), {n, 0, 1,
     1/(steps - 1)}]] // Ceiling // DeleteDuplicates

will result in a series with the following distribution

We save it to the file n_values with the command

Export["~/exp/benford/n_values.csv", %]

Preparing charts comparing program performance

We will save the performance measuring code in a file measure.sh

#!/usr/bin/zsh

while IFS= read -r i
do
 (time node benford.js "$i" > /dev/null) 2>&1 | awk -v i="$i" '{print $1,i,$6,$10,$12}' | tee -a logs;
 (time ./target/release/benford "$i" > /dev/null) 2>&1 | awk -v i="$i" '{print "rust",i,$5,$9,$11}' | tee -a logs;
 (time java Benford "$i" > /dev/null) 2>&1 | awk -v i="$i" '{print $1,i,$6,$10,$12}' | tee -a logs;
done;

We replaced the for loop with while. Using cat n_values.csv is permissible but not recommended

koalaman/shellcheck Wiki

It’s also worth enclosing $i in quotes. When we fetched data from the sequence, it didn’t matter, and it won’t affect the program now, but it’s good practice to use quotes because if the variables contain values with spaces, the words separated by spaces may be treated as arguments in subsequent positions instead of one value.

koalaman/shellcheck Wiki

We measure by entering

time zsh measure.sh

Uploading the created file

logs = Import["/home/daniel/exp/benford/logs", "Data"];

and we draw a graph

ListLogPlot[
 Table[{#[[1]],
     PadLeft[ToExpression /@ StringSplit[ToString[#[[2]]], ":"],
        2]*{60, 1} // Total} & /@
   GroupBy[logs, First][i][[All, {2, 5}]], {i, {"java", "rut",
    "node"}}],
 PlotLegends -> {"Java", "Rust", "Node"}, ImageSize -> Full,
 Frame -> True,
 FrameLabel -> {"Fibonaccin sequence length", "Total time"},
 LabelStyle -> Directive[FontSize -> 16]]

Summary:

the long startup time of the Java virtual machine prevented it from taking off in the early phase, making it perform the worst for small values of n.
surprisingly well managed Node, which although not recommended for CPU-intensive tasks, has a really well-optimized implementation of BigInt
unbeatable for low n turned out to be Rust, which, as it is not burdened by any runtime environment or interpreter, however succumbed to Java for large n, whose team has been improving the performance of Java in successive versions for years.

I realize that these programs can be optimized, for example in terms of memory usage - not holding entire arrays with strings. I tried to write them in a way that is as similar and simple as possible in all languages. If you noticed a mistake in them, I would be very grateful for bringing it to my attention in the comments.

Update: Implementations of large numbers in Rust

DK13 - a user of the wykop service pointed out that in Rust we have different implementations of large numbers and which one we choose significantly affects the final result.

Write once, debug everywhere.

https://github.com/tczajka/bigint-benchmark-rs#results

I will check this soon and update the content of this post.

Benford's Law for the Fibonacci Sequence in Java, Rust, and Node JS

Benford’s Distribution

Fibonacci Sequence

Java

Rust

Node JS

Performance Comparison of Programs

Measuring program performance depending on the argument

Separating the program stream from the time measurement stream

Preparation of a series of values `n` for performance analysis

Preparing charts comparing program performance

Update: Implementations of large numbers in Rust

Other articles

Data scraping in Perl

How the war for compatibility shaped the frontend?

Selected syntax in JavaScript ES2020, ES2021 and ES2022

Benford’s Distribution

Fibonacci Sequence

Java

Rust

Node JS

Performance Comparison of Programs

Measuring program performance depending on the argument

Separating the program stream from the time measurement stream

Preparation of a series of values n for performance analysis

Preparing charts comparing program performance

Update: Implementations of large numbers in Rust

Other articles

Data scraping in Perl

How the war for compatibility shaped the frontend?

Selected syntax in JavaScript ES2020, ES2021 and ES2022

Preparation of a series of values `n` for performance analysis