One of the most basic concepts in programming is the loop. Yet, Python is bad at making fast loops.

Loops are especially important in Machine Learning where most algorithms are supposed to be used on big datasets.

Slow code is one of the biggest flaws of Python and fixing this can boost the speed of your algorithms and, even better, your productivity.

Today we are going to see two optimization techniques that can be leveraged to make loops in numeric code really fast:

# What is Array Broadcasting?

It is a simple technique that you already use every day when you write

```
import numpy as np
a = np.arange(100)
b = a * 2
```

Here, NumPy understood that when you write `a * 2`

, you actually want to
multiply *every* element of `a`

by 2.

Array broadcasting allows more complex behaviors, see this example:

```
>>> a = np.arange(6).reshape(2, 3)
>>> a
array([[0, 1, 2],
[3, 4, 5]])
>>> b = np.array([1,2])
>>> b
array([1, 2])
>>> a * b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (2,3) (2,)
>>> b = b.reshape(2, 1)
>>> b
array([[1],
[2]])
>>> a * b
array([[ 0, 1, 2],
[ 6, 8, 10]])
```

What the hell just happened?

On the first attempt, we tried the multiplication between two arrays with different numbers of dimensions (2 and 1).

The second time, we gave `b`

a second dimension. When a dimension has
only 1 component, it can be automatically extended. Here, just before
doing the multiplication, it is as if we wrote:

```
>>> b = b.repeat(3, axis=1)
>>> b
array([[1, 1, 1],
[2, 2, 2]])
```

Broadcasting can also extend multiple components as we are going to see in the examples.

Note that it is possible to activate the broadcasting without calling
`b.reshape`

:

```
>>> a = np.arange(6).reshape(2, 3)
>>> b = np.array([1,2])
>>> b[:, np.newaxis].shape
(2, 1)
>>> b[:, np.newaxis]
array([[1],
[2]])
>>> a * b[:, np.newaxis]
array([[ 0, 1, 2],
[ 6, 8, 10]])
```

You can remember that `np.newaxis`

inserts a $1$ (a new axis) in the
shape of an array.

`np.newaxis`

is generally prefered to `reshape`

because you don’t need
to write the array sizes and it is more efficient as the array is not
copied in memory.

# What is Numba?

Numba is *just* a compiler that takes a subset of the Python language
and compiles it to a native function.

Let’s take the simplest example: a function that adds two objects.

```
from numba import njit, prange
@njit
def f(a, b):
return a + b
```

As you can see, Numba applies a
decorator to
`f`

.

Readers already familiar with Numba will be surprised I did not use
`jit`

decorator. `njit`

is equivalent to using `jit(nopython=True)`

. The
`nopython`

mode ensures that the code is really compiled without using
the interpreter anymore. It means that you have to restrict yourself a
bit more but if you don’t, you get an error instead of a slow code
(which is much easier to correct).

I recommend to always use `njit`

by default and use `jit`

only if you
get errors and know you are going to get a slower code.

Now, basic Python data types are going to be converted to Numba data types. All numeric types and arrays of NumPy are supported so you shouldn’t even notice it.

However, you can get surprises like this:

```
>>> f(2**63-1, 2**63-1)
-2
```

The problem is that Python’s `int`

is automatically converted to a
`int64`

hence you get the common pitfalls of bounded ints.

Let’s take a look at NumPy datatypes and check the signatures that Numba used:

```
>>> f(np.array([1., 2., 3.]), np.array([4., 5., 6.]))
array([5., 7., 9.])
>>> f.signatures
[(int64, int64), (array(float64, 1d, C), array(float64, 1d, C))]
```

The function `f`

has been called and successfully compiled with two
different data types: first with two `int64`

, then with a 1-dimensional
array of `float64`

(the `C`

stands for C-style array order but you can
ignore it).

But adding two integers or arrays is not very impressive. What makes Numba shine are really loops like in the example.

Note: don’t reimplement linear algebra computations (like `np.dot`

for
matrices) in Numba, the Numpy implementation is very optimized and can
be called in Numba.

# Pros and cons of each method

**Array Broadcasting**’s pros:

- you only write Python code
- it is very fast for any array size
- there is no compilation overhead
- you don’t need to install anything, it is just a feature of NumPy
- it works similarly in PyTorch where the speedup is even more consequent when you apply Autograd.

**Array Broadcasting**’s cons:

- only works for specific loops (that are hopefully very frequent)
- can use more memory compared to a loop-based approach
- might be hard to understand the first time

**Numba**’s pros:

- the speed is amazing, within a 2-5x factor from C++
- you only write Python code
- it works for almost any numeric code you would like to write
- Numba comes with automatic parallelization — Yes, that’s right, Numba can exploit all cores in your processor to get even more amazing speedups!
- there is now support for GPU-based computations

**Numba**’s cons:

- can only compile a subset of Python. For example, you cannot use any Python object.
- the code needs to be more verbose to take full benefits from Numba
- requires a C compiler compatible with your Python installation for the compatibility requirements, but most people should be fine
- there is a little compilation overhead (that can be overcome with
the option
`cache=True`

) - you must use a widespread processor architecture: see here
- cannot be applied to PyTorch code (but PyTorch come with its own JIT system)

# A concrete example: Closest centroid

Suppose we have $n + k$ vectors of dimension $d$. The $n$ vectors are data points an the $k$ are centroids. We want to assign to every data point the closest centroid. This example is actually the assignment step of Lloyd’s algorithm for K-means clustering.

Let’s implement this in Python with loops:

```
def lloyd_simple(data, centroids):
k, _ = centroids.shape
return np.array(
[
min(range(k), key=lambda i: np.sum((point - centroids[i]) ** 2))
for point in data
]
)
```

What if we could remplace the `min`

by some `np.argmin`

?

```
def distance_to_centroids(point, centroids):
return np.sum((point[np.newaxis, :] - centroids) ** 2, axis=1)
def lloyd_broadcast_1(data, centroids):
return np.array(
[
np.argmin(distance_to_centroids(point, centroids))
for point in data
]
)
```

`point[np.newaxis, :]`

tells that we will need to create multiple
instances of the point along the first axis to be able to compare them
to the centroids. `point[np.newaxis, :] - centroids`

creates an array of
shape `(k, d)`

.

Now, what if we can replace even the `for`

loop with a broadcast? We are
first going to compute the distance matrix for all data points and all
centroids, then do an `argmin`

over the dimension of the centroids.

```
def all_distances_to_centroids(data, centroids):
return np.sum((data[:, np.newaxis, :] - centroids[np.newaxis, :, :]) ** 2, axis=2)
def lloyd_broadcast_2(data, centroids):
return np.argmin(all_distances_to_centroids(data, centroids), axis=1)
```

`data[:, np.newaxis, :]`

is an array of shape`(n, 1, p)`

.`centroids[:, np.newaxis, :]`

is an array of shape`(1, k, p)`

- Because of the broadcasting, their substraction has shape
`(n, k, p)`

- Squaring doesn’t change shapes
`sum(axis=2)`

wipes the last dimension- Finally,
`all_distances_to_centroids`

returns an array of shape`(n, k)`

which is the distance matrix

We just saw it was possible to simultaneously broadcast over 2 dimensions!

Note that this function uses much more memory as it build an array of
shape `(n, k, p)`

.

Finally, the Numba implementation:

```
@njit
def lloyd_numba(data, centroids):
n, d = data.shape
k, _ = centroids.shape
answer = np.empty(n, dtype=np.int64)
for i in range(n):
min_dist_i = np.finfo(np.float64).max
for j in range(k):
dist_i_j = 0
for u in range(d):
dist_i_j += (data[i, u] - centroids[j, u]) ** 2
if dist_i_j < min_dist_i:
min_dist_i = dist_i_j
answer[i] = j
return answer
```

and the version parallelized over the data points:

```
@njit(parallel=True)
def lloyd_numba_parallel(data, centroids):
n, d = data.shape
k, _ = centroids.shape
answer = np.empty(n, dtype=np.int64)
for i in prange(n):
min_dist_i = np.finfo(np.float64).max
for j in range(k):
dist_i_j = 0
for u in range(d):
dist_i_j += (data[i, u] - centroids[j, u]) ** 2
if dist_i_j < min_dist_i:
min_dist_i = dist_i_j
answer[i] = j
return answer
```

We changed only two things:

- we added the parallel option in
`njit(parallel=True)`

- we replaced
`range(n)`

with`prange(n)`

to tell what should be parallelized

# Benchmark

We generate some random data with:

```
n_small = 20000
n_big = 200000
k = 50
d = 100
data_small = np.random.uniform(size=(n_small, d))
data_big = np.random.uniform(size=(n_big, d))
centroids = np.random.uniform(size=(k, d))
```

Here are the results of the benchmark. I did them on a 2-threaded colab instance which you can find here with all the code from this article:

Function | Compilation | Small | Big |
---|---|---|---|

`lloyd_simple` |
0 s | 5.9 s | 59 s |

`lloyd_broadcast_1` |
0 s | 370 ms | 3.6 s |

`lloyd_broadcast_2` |
0 s | 725 ms | 7 s |

`lloyd_numba` |
58 ms | 137 ms | 1.4 s |

`lloyd_numba_parallel` |
57 ms | 84 ms | 921 ms |

We can see a few things:

- Basic broadcasting makes the code a lot faster, actually 16x faster.
`lloyd_broadcast_2`

is slower than`lloyd_broadcast_1`

because it builds a huge matrix of shape`(n, k, p)`

and memory allocation is not free. Be careful with your RAM: for the big dataset this amounts to $200000*50*100 = 10^9$ floating-point numbers that each take 8 bytes, totalling $10^9 / 2^{30} * 8 = 7,45$ GB.- Numba makes the code another 2.5x faster and does not use additional memory.
- Parallelization makes the code even faster and much more importantly scales with the number of processors: it will be about 4 times faster when you get 8 processors instead of 2.

# Conclusion

**Broadcasting** is a powerful technique that makes your code orders of
magnitude faster. While it allows greater speeds, it comes at the
expense of a greater memory usage which you should be careful about —
for example you should probably not broadcast over the size of your
dataset like we did. It is also the most mathematical and concise way
to describe your operations.

**Numba** is definitely the most flexible and fastest way to implement
loops and almost any numeric function in Python. Although it makes the
code a bit more complex, it is much easier to deploy than a C extension
or Cython. Furthermore (maybe the greatest advantage) it allows to
parallelize functions almost instantly and augment the speed even more
with multiple processors.