Improving performance in Python

Daniel Molina published on July 15, 2012

6 min, 1069 words

All the source code of this post is available at github.

In the previous post, I recognized my predilection for Python. For me, it is a great language for create prototypes in many areas. For my research work, I usually creates/designs algorithms for continuous optimization using evolutionary algorithms. For these algorithms, languages like C/C++ or Java are widely used, specially for its good performance (to publish, it is usual to have to make many comparisons between algorithms, so the performance could be critical. However, for testing new ideas, many authors uses other tools like Mathlab that reduces the developer time at the cost of a higher computing time.

I agree that Mathlab is great for numerical algorithms, but I still prefer Python over Mathlab, because I'm more confortable with it, and have many libraries, and it's more simpler to call code in other languages, written in C or Java. That allow us to increase the performance, and I like to test how much it could be improved.

Several months ago, I start writing my most succesful algorithm, Memetic Algorithms based on LS Chaining, in Python. I had several doubts about the performance, so I start writing one element, an Steady-State Genetic Algorithm, in Python.

Calling C/C++ code from python

The first challenge I had to tackle was to allow my python program to use the same benchmark functions than other implementations, CEC'2005 benchmark. This benchmark define the functions to optimize, thus its main funtionality is evaluate my solutions, when each solution is a vector of real numbers, with a real fitness value. The benchmark code was implemented (by its authors) in C/C++. So, my python code have to call C++ code.

For doing that, I used the library boost::python, that is, in my opinion, the simpler way to call C/C++ code, specially when we uses numpy package.

In my case, it is very simple, because I need few functions:

#include <boost/python.hpp>
#include <boost/python/numeric.hpp>
#include <boost/python/list.hpp>
#include <iostream>
#include "cec2005/cec2005.h"
#include "cec2005/srandom.h"

using namespace boost::python;

Random r(new SRandom(12345679));

void set_function(int fun, int dim) {
    init_cec2005(&r, fun, dim);
}

double evalua(const numeric::array &el) {
   const tuple &shape = extract<tuple>(el.attr("shape"));
   unsigned n = boost::python::extract<unsigned>(shape[0]);
   double *tmp = new double[n];
  for(unsigned int i = 0; i < n; i++)
    {
      tmp[i] = boost::python::extract<double>(el[i]);
    }
  double result = eval_cec2005(tmp, n);
  delete tmp;
  return result;
}
...

BOOST_PYTHON_MODULE(libpycec2005)
{
    using namespace boost::python;
    numeric::array::set_module_and_type( "numpy", "ndarray");
    def("config", &set_function);
    def("evaluate", &evalua);
    ...
}

More info in the good boost::python documentation.

One we can call C/C++ code, we have implemented the algorithm in python code. The test code was the following:

from ssga import SSGA
from readargs import ArgsCEC05
import libpycec2005 as cec2005
import numpy

def check_dimension(option, opt, value):
    if value not in [2, 10, 30, 50]:
        raise OptionValueError(
            "option %s: invalid dimensionality value: %r" % (opt, value))

def main():
    """
    Main program
    """
    args = ArgsCEC05()

    if  args.hasError:
        args.print_help_exit()

    fun = args.function
    dim = args.dimension

    print "Function: %d" %fun
    print "Dimension: %d" %dim
    cec2005.config(fun, dim)
    domain = cec2005.domain(fun)
    print "Domain: ", domain
    ea = SSGA(domain=domain, size=60, dim=dim, fitness=cec2005.evaluate)

    for x in xrange(25):
        ea.run(maxeval=dim*10000)
        [bestsol, bestfit] = ea.getBest()
        print "BestSol: ", bestsol
        print "BestFitness: %e" %bestfit
        ea.reset()

if __name__ == "__main__":
    main()

This source code run the algorithm 25 times, and in each run the algorithm stops when they are created 10000*dim solutions. These conditions are indicated in the benchmark specification. The only parameter was the function (-f, used function 1 by default) and dimension (-d) from 10, 30, 50.

Profiling the computing time

How much time it takes? I have changed xrange(25) for xrange(1) and we have run its current version. The final time was 7 minutes for dimension 10, and 21 minutes for dimension 30 (for only one function).

Because I like to make more interesting things, that only waiting computing time, I use the profile, only one run for the function, to detect the functions/method more expensive in computing time.

python -m cProfile runcec.py -f 1 -d 10

The output was the following:

        2943600 function calls (2943531 primitive calls) in 31.031 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
....
      1    0.001    0.001    0.126    0.126 ssga.py:1(<module>)
    99940    0.561    0.000   17.463    0.000 ssga.py:109(cross)
        1    0.000    0.000    0.000    0.000 ssga.py:123(reset)
        1    5.559    5.559   51.129   51.129 ssga.py:126(run)
        1    0.000    0.000    0.000    0.000 ssga.py:14(__init__)
        1    0.000    0.000    0.000    0.000 ssga.py:158(getBest)
        1    0.000    0.000    0.000    0.000 ssga.py:31(set_mutation_rate)
    99940    0.730    0.000    1.885    0.000 ssga.py:45(mutation)
    12438    0.286    0.000    0.758    0.000 ssga.py:50(mutationBGA)
        1    0.002    0.002    0.002    0.002 ssga.py:77(initPopulation)
   105883    1.101    0.000    5.604    0.000 ssga.py:89(updateWorst)
        1    0.000    0.000    0.000    0.000 ssga.py:9(SSGA)
    99940    1.049    0.000   20.617    0.000 ssga.py:97(getParents)
...

With the profile we can observe the most expensive methods in our code: getParents (20 seconds), crossover operator (17 seconds), and updateWorst (5 seconds). These methods are the 85% of the computing time, and the first two methods the 74% of the computing time.

Optimising the code

That proves the majority of computing time is due to a minority of the code, only three methods. If we can optimize these methods, our code could be improved a lot.

We can uses again the boost::python package, but it's a bit tedious to use it. So, we have used the cython package. With cython we can optimize the source code adding information about the types.

For instead, Instead of the following code:

import numpy as np

def distance(ind1,ind2):
    """
    Euclidean distance
    ind1 -- first array to compare
    ind2 -- second array to compare

    Return euclidean distance between the individuals

    >>> from numpy.random import rand
    >>> import numpy as np
    >>> dim = 30
    >>> sol = rand(dim)
    >>> distance(sol,sol)
    0.0
    >>> ref=np.zeros(dim)
    >>> dist=distance(sol,ref)
    >>> dist > 0
    True
    >>> dist2 = distance(sol*2,ref)
    >>> 2*dist == dist2
    True
    """
    dif = ind1-ind2
    sum = (dif*dif).sum()
    return math.sqrt(sum)

we can write:

cimport numpy as np
cimport cython
DTYPE = np.double
ctypedef np.double_t DTYPE_t
ctypedef np.int_t BTYPE_t

def distance(np.ndarray[DTYPE_t, ndim=1]ind1, np.ndarray[DTYPE_t, ndim=1] ind2):
    """
    Euclidean distance
    ind1 -- first array to compare
    ind2 -- second array to compare

    ....
    """
    cdef np.ndarray[DTYPE_t, ndim=1] dif = ind1-ind2
    cdef double sum = (dif*dif).sum()
    return math.sqrt(sum)

We can see that is still very readable. we only have put information about the type and dimension in the vector parameters and about the variables, using the keyword cdef.

Let's see as an example the first method, the crossover operator, implemented in the crossBLX method:

import numpy as np
import math

def crossBLX(mother,parent,domain,alpha):
    """
    crossover operator BLX-alpha

    mother -- mother (first individual)
    parent -- parent (second individual)
    domain -- domain to check
    alpha  -- parameter alpha

    Returns the new children following the expression children = random(x-alpha*dif, y+alpha*dif),
                where dif=abs(x,y) and x=lower(mother,parents), y=upper(mother,parents)

    >>> import numpy as np
    >>> low=-5
    >>> upper = 5
    >>> dim=30
    >>> sol = np.array([1,2,3,2,1])
    >>> crossBLX(sol,sol,[low,upper],0)
    array([ 1.,  2.,  3.,  2.,  1.])
    """
    diff = abs(mother-parent)
    dim = mother.size
    I=diff*alpha
    points = np.array([mother,parent])
    A=np.amin(points,axis=0)-I
    B=np.amax(points,axis=0)+I
    children = np.random.uniform(A,B,dim)
    [low,high]=domain
    return np.clip(children, low, high)

We can see that it is very simple to implement using numpy, but it is still very slow. With cython I have defined directly implement the many operations, the following code:

def crossBLX(np.ndarray[DTYPE_t, ndim=1] mother,np.ndarray[DTYPE_t, ndim=1] parent,list domain, double alpha):
    """
    ...
    """
    cdef np.ndarray[DTYPE_t, ndim=1] C, r
    cdef int low, high, dim
    cdef double x, y
    cdef double I, A, B
    cdef unsigned i
    [low,high]=domain
    dim = mother.shape[0]
    C = np.zeros(dim)
    r = random.randreal(0,1,dim)

    for i in range(dim):
        if mother[i] < parent[i]:
           (x,y) = (mother[i],parent[i])
        else:
           (y,x) = (mother[i],parent[i])

        I = alpha*(y-x)
        A=x-I
        B=y+I

        if (A < low):
            A = low
        if (B > high):
            B = high

        C[i] = A+r[i]*(B-A)

    return C

It's true that the source code is more complicated, but it is still very readable. I have compared the two implementation to make sure both return the same values.

Measuring the improvement

How much these small changes in the code? I have profile the source code again and it gives me:

         1020045 function calls (1019976 primitive calls) in 18.003 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
....
        1    0.001    0.001    0.127    0.127 ssga.py:1(<module>)
    99940    0.425    0.000    2.432    0.000 ssga.py:109(cross)
        1    0.000    0.000    0.000    0.000 ssga.py:123(reset)
        1    5.415    5.415   17.864   17.864 ssga.py:126(run)
        1    0.000    0.000    0.000    0.000 ssga.py:14(__init__)
        1    0.000    0.000    0.000    0.000 ssga.py:158(getBest)
        1    0.000    0.000    0.000    0.000 ssga.py:31(set_mutation_rate)
    99940    0.699    0.000    2.006    0.000 ssga.py:45(mutation)
    12544    0.338    0.000    0.929    0.000 ssga.py:50(mutationBGA)
        1    0.002    0.002    0.002    0.002 ssga.py:77(initPopulation)
   105959    0.775    0.000    1.343    0.000 ssga.py:89(updateWorst)
        1    0.000    0.000    0.000    0.000 ssga.py:9(SSGA)
    99940    0.940    0.000    6.665    0.000 ssga.py:97(getParents)
....

We can see the improvement obtained.

Method	Python	Cython
cross :	17.4	2.4
getParents :	20.6	6.6
updateWorst :	5.6	1.3
Total	43.6	10.3

The new code takes only a 23% of the computing time of the previous code. With these changes, we have reduced the total time from 51 seconds to 18 code.

In perspective

Now, I run the source code without the profile, and test the source code obtaining the following time:

Method	dim=10	dim=30	dim=50
Python	44s	3240s (54m)	--
Cython	10s	28s	48s
Improvement	77%	99%	---

In the following table, we test the maximum time for one and 25 runs (the time depends on the function used).

#functions	dim=10	dim=30	dim=50
1	10s/18s	28s/40s	48s/1m
25	3/7m	15/21m	38m/

So, the total computing time is 7 minutes for dimension 10, and 21 minutes for dimension 30. These numbers are very acceptable, specially because we can test in parallel the different functions in a cluster of computers. Unfortunately, an implementation in Mathlab not only take more time, but also, for licensing reasons, it could not run in parallel without limit.

In resume, we can uses python code, not only to create experimental prototypes, but also to create functional prototypes.

And about the possible testing problem, I've been working on it, but I think it is enough for a post, didn't it? :-)

All the code refered in the post, both in python and cython, is available at github, if you want to see it.