Tuesday, February 21, 2017

The Black Magic of Deep Learning - Tips and Tricks for the practitioner

I've been using Deep Learning and Deep Belief Networks since 2013.
I was involved in a green field project and I was in charge of deciding the core Machine Learning algorithms to be used in a computer vision platform.

Nothing worked good enough and if it did it wouldn't generalize, required fiddling all the time and when introduced to similar datasets it wouldn't converge. I was lost. I then caught wind from Academia, the new hype of Deep Learning was here and it would solve everything.

I was skeptical, so I read the papers, the books and the notes. I then went and put to work everything I learned. 
Suprisingly, it was no hype, Deep Learning works and it works well. However it is such a new concept (even though the foundations were laid in the 70's) that a lot of anecdotal tricks and tips started coming out on how to make the most of it (Alex Krizhevsky covered a lot of them and in some ways pre-discovered batch normalization).

Anyway to sum, these are my tricks (that I learned the hard way) to make DNN tick.
  • Always shuffle. Never allow your network to go through exactly the same minibatch. If your framework allows it shuffle at every epoch. 
  • Expand your dataset. DNN's need a lot of data and the models can easily overfit a small dataset. I strongly suggest expanding your original dataset. If it is a vision task, add noise, whitening, drop pixels, rotate and color shift, blur and everything in between. There is a catch though if the expansion is too big you will be training mostly with the same data. I solved this by creating a layer that applies random transformations so no sample is ever the same. If you are going through voice data shift it and distort it
  • This tip is from Karpathy, before training on the whole dataset try to overfit on a very small subset of it, that way you know your network can converge.
  • Always use dropout to minimize the chance of overfitting. Use it after large > 256 (fully connected layers or convolutional layers). There is an excellent thesis about that (Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning)
  • Avoid LRN pooling, prefer the much faster MAX pooling.
  • Avoid Sigmoid's , TanH's gates they are expensive and get saturated and may stop back propagation. In fact the deeper your network the less attractive Sigmoid's and TanH's are. Use the much cheaper and effective ReLU's and PreLU's instead. As mentioned in Deep Sparse Rectifier Neural Networks they promote sparsity and their back propagation is much more robust.
  • Don't use ReLU or PreLU's gates before max pooling, instead apply it after to save computation
  • Don't use ReLU's they are so 2012. Yes they are a very useful non-linearity that solved a lot of problems. However try fine-tuning a new model and watch nothing happen because of bad initialization with ReLU's blocking backpropagation. Instead use PreLU's with a very small multiplier usually 0.1. Using PreLU's converges faster and will not get stuck like ReLU's during the initial stages. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ELU's are still good but expensive.
  • Use Batch Normalization (check paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift) ALWAYS. It works and it is great. It allows faster convergence ( much faster) and smaller datasets. You will save time and resources.
  • I don't like removing the mean as many do, I prefer squeezing the input data to [-1, +1]. This is more of  a training and deployment trick rather a performance trick.
  • Always go for the smaller models, if you are working and deploying deep learning models like me, you quickly understand the pain of pushing gigabytes of models to your users or to a server in the other side of the world. Go for the smaller models even if you lose some accuracy.
  • If you use the smaller models try ensembles. You can usually boost your accuracy by ~3% with an enseble of 5 networks. 
  • Use xavier initialization as much as possible. Use it only on large Fully Connected layers and avoid them on the CNN layers. An-explanation-of-xavier-initialization
  • If your input data has a spatial parameter try to go for CNN's end to end. Read and understand SqueezeNet , it is a new approach and works wonders, try applying the tips above. 
  • Modify your models to use 1x1 CNN's layers where it is possible, the locality is great for performance. 
  • Don't even try to train anything without a high end GPU.
  • If you are making templates out of models or your own layers, parameterize everything otherwise you will be rebuilding your binaries all the time. You know you will
  • And last but not least understand what you are doing, Deep Learning is the Neutron Bomb of Machine Learning. It is not to be used everywhere and always. Understand the architecture you are using and what you are trying to achieve don't mindlessly copy models.  
To get the math behind DL read Deep-Learning-Adaptive-Computation-Machine.
It is an excellent book and really clears things up. There is an free pdf on the net. But buy it to support the authors for their great work.
For a history lesson and a great introduction read Deep Learning: Methods and Applications (Foundations and Trends in Signal Processing) 
If your really want to start implementing from scratch, check out Deep Belief Nets in C++ and CUDA C, Vol. 1: Restricted Boltzmann Machines and Supervised Feedforward Networks
Suggested reading

Tuesday, August 9, 2016

YEAH, Yodigram is OUT

Coming back to this (rarely updated) blog to note another landmark moment for me. The project I've been working on for so long is finally in version 1 and we lining up customers and interested parties. Yodigram is now reality, I am very proud and very tired.

It's been kind of a turbocharged year, going through the ups and downs of a startup, designing a cutting edge system from scratch, watch it run and being sold, yodiwo wining the MITEF competition, taking my own side consulting jobs.
Yodigram super awesome results, Products and Brands are detected and classified automatically .

I believe I'm growing as a professional, it's the pressure, it either breaks you or makes you. I'm also growin as a machine learning engineer. I study hard, deep and constantly it's almost ridiculous.

We are now going through the deep learning revolution and now that I have almost 2 years of practical experience on it I believe I can catch some of its hype wave. It is an exciting time for technologists.

I havent kept on my the studying schedule - maybe it was too ambitious - I found that my real interest lies into the Data Science/Machine Learning/ Optimization domain instead of Data Engineering. 

Tuesday, October 6, 2015

New beginning

It's been a long time since I last updated my blog.
This has been more of a journal for my thoughts and coming back to it seems strange.

For the last 4 years  I've dubbled into the dirty world of software engineering.
When I say 'dirty' I use it in the time constraint, product must leave now, we will fix it later kind of way. It is word I don't use lightly. I've seen projects go to hell and spend countless hours debugging  because of these practice.

Nobody has ever called me a perfectionist in my life, quite the opposite, but I've grown to be diligent when tackling projects. It seems I have an eye for what can go wrong in big systems and for removing complexity. I wouldn't call it talent, I will just say I screwed up so many times I would be an idiot to not see them coming by now.

I've learned so much  and tangled small and huge projects with various levels of success.
I now feel much more confident in my engineering skils. Confident in a way you can be not by measuring yourself against an ideal but against fellow engineers.

During these years I honed my skills in linux, embedded developing, messaging, networking and much much more. My mind however was constantly on how to get back to my real interest which is math and machine learning. Trying to keep up on that area, since a LOT happened during these 4 years in the field, and doing my real job has been really exhaustive. Thankfully I was given the chance to use my skills on 2 projects so there has been some overlap.

Now it is time for a new beginning so I plan on start posting again. I've quit my job and I'll be working full time on startup project with my friends and co-programmers @ www.yodiwo.com doing machine learning and computer vision.

I made a very tight schedule for the next six months in order to get my skills and knowledge up to speed.

- C++, Erlang
- Signal Processing, Compressive Sensing
- Machine Learning, Computer Vision
- Big Data Tools (Spark, Hadoop, Scala)

Sunday, March 25, 2012

Cubic spline interpolation

It's been a long since I actually coded any interpolation method.
Matlab is notorious for making you lazy since it's so easy to get things done and you tend to stop looking under the hood. A friend asked me for help on a cubic interpolation problem and since that was too easy I expanded it so I can use it on my projects.

The math behind cubic spline is really simple. You piecewise fit cubic polynomials using 4 data values (two points and two tangents) in order to create a smooth spline that passes from all given points. The wikipedia sources are really good so I won't dive into the math. Instead I'll provide some matlab code for doing the dirty deed. Matlab (as always) has a command for this (spline) but we wont be using it because I like getting my hands dirty.
function [ yy, xx] = cubicSpline(N,points,gradients)
% CUBICSPLINE - returns N interpolation points using cubic spline
%   interpolation method, this method cannot be used for space curves
% Input
% N         : number of interpolation points 
% points    : given points [x y]
% gradients : gradient at first and last point 
% Output    
% xx        : uniform spaced function interpolation at N points
% yy        : uniform spaced N points
    %% Validate input arguments
    if isempty(N) || N < 1
        error('N must be >= 1');
    if isempty(points) || size(points,1) < 1
        error('point must have >= 1 rows');
    if isempty(points) || size(points,2) ~= 2
        error('point must have 2 collumns');
    if isempty(gradients) || numel(gradients) ~= 2
        error('gradients must have 2 elements');
    %% coefficient calculation part
    % get number of points
    [rows ~] = size(points);
    % compute inverse matrix to be used
    matrix = inv([1 0 0 0 ; 0 1 0 0 ; 1 1 1 1; 0 1 2 3]);
    % initialize coefficients structure
    coefficients = zeros(rows-1,4);
    % given n points we must calculate n-1 polynomials
    for i = 2 : rows
        pEnd = [];
        pStart = [];
        % calculate gradient using finite central differences
        if (i-1) == 1
            pStart = gradients(1);
            pStart = (points(i,2) - points(i-2,2))/2;
        if i == rows
            pEnd = gradients(2);
            pEnd = (points(i+1,2) - points(i-1,2))/2;
        % create vector [Pi P'i Pi+1 P'i+1]'
        vector = [points(i-1,2);pStart;points(i,2);pEnd];
        % calculate polynomial coefficients
        coefficients(i-1,:) = (matrix * vector)';
    %% interpolation part
    % get max X and min X and interval
    minX = points(1,1);
    maxX = points(end,1);
    intervalX = (maxX - minX) / (N - 1);
    xx = minX : intervalX : maxX;
    % interpolate at given locations
    yy = zeros(1,N);
    splineIndex = 1;
    for i = 2 : N-1
        x = xx(i);
        % find the index of the used spline
        for j = splineIndex : rows
            if x >= points(j,1) && x < points(j+1,1)
                splineIndex = j;
        splineCoeffs = coefficients(splineIndex,:);
        % compute m 
        m = (xx(i) - points(splineIndex,1))/...
            (points(splineIndex+1,1) - points(splineIndex,1));
        % compute value with given spline and m
        yy(i) = splineCoeffs(1) + splineCoeffs(2) * m + ...
            splineCoeffs(3) * m^2 + splineCoeffs(4) * m^3;
    yy(1) = points(1,2);
    yy(end) = points(end,2);

This code can be used to interpolate y=f(x) functions. For example :

%% Demonstration of cubic splines
N = 100;
x = [0:1:10];
y = sin(x);
xOriginal = [0:0.1:10];
yOriginal = sin(xOriginal);
gradient = [0 0];
[yy xx] = cubicSpline(N,[x' y'],gradient);
legend('Original function','Interpolation spline','Given points');

gives us the following graph :

any errors at the beginning and the end are due to the fact that I entered zero gradient at those points but provided the correct gradients the result should be much more precise.

To to make it somewhat useful in my projects I should use this function as a basis for calculating space curves. This excellent source explains that space curves are functions of u such as y = f(u) and x = f(u).

function [ yy,xx ] = cubicSpline2d(N, points, gradients )
% CUBICSPLINE - returns N interpolation points using cubic spline
%   interpolation method, this method cann be used for space curves
% Input
% N         : number of interpolation points 
% points    : given points [x y]
% gradients : gradient at first and last point 
% Output    
% xx        : uniform spaced function interpolation at N points
% yy        : uniform spaced N points
    %% Validate input arguments
    if isempty(N) || N < 1
        error('N must be >= 1');
    if isempty(points) || size(points,1) < 1
        error('point must have >= 1 rows');
    if isempty(points) || size(points,2) ~= 2
        error('point must have 2 collumns');
    if isempty(gradients) || numel(gradients) ~= 4
        error('gradients must have 4 elements');
    % get number of points
    [rows ~] = size(points);
    % get total length of points
    u = [1 : rows]';
    x = [points(:,1)];
    y = [points(:,2)];
    [xx,~] = cubicSpline(N, [u x],gradients(:,1));
    [yy,~] = cubicSpline(N, [u y],gradients(:,2));

and the test script

%% Demonstration of cubic splines 2d
u = 0:0.5:2*pi;
N = numel(u)*10;
y = sin(u);
x = sin(u) + cos(u);
gradient = [0 0; 0 0];
[yy xx] = cubicSpline2d(N,[x' y'],gradient);
legend('Interpolation spline','Given points');
title('Cubic space curve interpolation') 

gives us the following plot :

You can download the code here.

Friday, March 2, 2012

Latest results

As part of my hand tracking project I post these last videos. I believe I reached the top of the performance of the particle filter algorithm.

Unfortunately I want even better results so I should move to more complex algorithms. The problem is that real time performance is going to be much more difficult to achieve.

Contour Refinement

In my hand tracking project I've used color segmentation to get the hand contour.
The problem is that usually the contour is not perfect, there are parts missing and/or there is some noise, for that reason I coded this general purpose contour refining function that fits the contour on a feature map moving each point along its normal.

Download file here

        /// <summary>
        /// Find better positions for each better at contour by searching on the edge normals on
        /// the feature map
        /// </summary>
        /// <param name="_objectContour">The contour to be refined</param>
        /// <param name="_featureMap">The feature map to be refined unto</param>
        /// <param name="_normalOffset">The maximum number of pixels to offset</param>
        /// <param name="_featureThreshold">The minimum feature value acceptable</param>
        /// <returns>Refined Contour</returns>
        public static Seq<Point> ContourRefine(
            Seq<Point> _objectContour,
            Image<Gray, float> _featureMap,
            int _normalOffset = 5,
            float _featureThreshold = float.MaxValue,
            float _inertiaCoeff = 1.0f,
            float _multiplierCoeff = -1.0f)
            List<Point> pointsFitted = new List<Point>();
            Point[] pointsArray = _objectContour.ToArray();
            for (int i = 0; i < pointsArray.Length; i++)
                int noPoints = pointsArray.Length,
                    ki = (i + 1) % noPoints, ik = (i >= 1) ? (i - 1) : (noPoints - 1 + i) % noPoints;
                Point pointCurrent = pointsArray[i],
                      pointNext = pointsArray[ki],
                      pointPrev = pointsArray[ik];
                // get normals pointing in and out
                PointF pointNormalOut = NormalAtPoint(pointPrev, pointCurrent, pointNext, false),
                    pointNormalIn = NormalAtPoint(pointPrev, pointCurrent, pointNext, true);
                // get points away from normal
                Point pointOut = new Point(
                        (int)Math.Round(pointNormalOut.X * _normalOffset) + pointCurrent.X,
                        (int)Math.Round(pointNormalOut.Y * _normalOffset) + pointCurrent.Y),
                    pointIn = new Point(
                        (int)Math.Round(pointNormalIn.X * _normalOffset) + pointCurrent.X,
                        (int)Math.Round(pointNormalIn.Y * _normalOffset) + pointCurrent.Y);
                LineSegment2D lineOut = new LineSegment2D(pointCurrent, pointOut),
                    lineIn = new LineSegment2D(pointCurrent, pointIn);

                // sample along the normals
                float[,] sampleIn = _featureMap.Sample(lineIn);
                float[,] sampleOut = _featureMap.Sample(lineOut);
                float maxByte = 0.0f, sample = 0.0f;
                int j = 0;
                bool inOut = false;
                // run through the normal pointing out to find the best fit
                for (int k = 0; k < sampleOut.Length; k++)
                    sample = sampleOut[k, 0] + _multiplierCoeff * (float)Math.Pow(_inertiaCoeff, k);
                    if (sample > maxByte)
                        maxByte = sample;
                        j = k;
                        inOut = false;

                // run through the normal pointing in to find the best fit
                for (int k = 0; k < sampleIn.Length; k++)
                    sample = sampleIn[k, 0] + _multiplierCoeff * (float)Math.Pow(_inertiaCoeff, k);
                    if (sample > maxByte)
                        maxByte = sample;
                        j = k;
                        inOut = true;

                // if feature on point found exceeds a threshold add it to the contour
                if (maxByte >= _featureThreshold)
                    int x, y;
                    double length, xLength, yLength;
                    if (!inOut)
                        xLength = lineOut.P1.X - lineOut.P2.X;
                        yLength = lineOut.P1.Y - lineOut.P2.Y;
                        length = lineOut.Length;
                        x = (int)Math.Round((float)j / (float)sampleOut.Length * pointNormalOut.X * _normalOffset);
                        y = (int)Math.Round((float)j / (float)sampleOut.Length * pointNormalOut.Y * _normalOffset);
                        xLength = lineIn.P1.X - lineIn.P2.X;
                        yLength = lineIn.P1.Y - lineIn.P2.Y;
                        length = lineIn.Length;
                        x = (int)Math.Round((float)j / (float)sampleIn.Length * pointNormalIn.X * _normalOffset);
                        y = (int)Math.Round((float)j / (float)sampleIn.Length * pointNormalIn.Y * _normalOffset);
                    pointsFitted.Add(new Point(pointCurrent.X + x, pointCurrent.Y + y));
            _objectContour.PushMulti(pointsFitted.ToArray(), BACK_OR_FRONT.BACK);
            return _objectContour;
        /// <summary>
        /// Calulcate the normal at given point
        /// </summary>
        /// <param name="_prevPoint">Previous point</param>
        /// <param name="_currentPoint">Current point</param>
        /// <param name="_nextPoint">Next point</param>
        /// <param name="_inOut">In or out flag</param>
        /// <returns>Normal at point</returns>
        public static PointF NormalAtPoint(
            Point _prevPoint, 
            Point _currentPoint, 
            Point _nextPoint, 
            bool _inOut = true)
            PointF normal;
            float dx1 = _currentPoint.X - _prevPoint.X,
                  dx2 = _nextPoint.X - _currentPoint.X,
                  dy1 = _currentPoint.Y - _prevPoint.Y,
                  dy2 = _nextPoint.Y - _currentPoint.Y;
            if (_inOut)
                normal = new PointF((dy1 + dy2) * 0.5f, -(dx1 + dx2) * 0.5f);
                normal = new PointF(-(dy1 + dy2) * 0.5f, (dx1 + dx2) * 0.5f);
            return NormalizePoint(normal);
        /// <summary>
        /// Normalize a given point so its _noBinsAngle equals to one
        /// </summary>
        /// <param name="_point">Point to normalize</param>
        /// <returns>Normalized point</returns>
        public static PointF NormalizePoint(PointF _point)
            float length = (float)Math.Sqrt(_point.X * _point.X + _point.Y * _point.Y);
            if (length > 0.0f)
                return new PointF(_point.X / length, _point.Y / length);
            return new PointF(0.0f, 0.0f);

Monday, February 27, 2012

HAAR xml file

Because many people have asked for it, I believe that it will make your life easier I give you my trained hand HAAR cascade xml file.
It's trained on about 20k positives and 20k negatives and works on any orientation.
Watch for high false positive rates. It also works with the cuda version of OpenCV.

It will help you but it won't make you happy.

The xml download. In a later post I will show you how to make haar cascade perform even better.

Blog Directory Hostgator promo codes
Premium Trick