Tuesday, May 8, 2018

Analytics Hubris

The word Hubris is ancient Greek and is used when man claims godlike powers, in our case, the power to foresee, to predict and control the future. Hubris is often followed by Nemesis which punishes the men that have stepped toofar.

Economic/Business growth happens on the verge of hubris. Risky behavior (hubris) is materialized in wealth if the venture is successful or bankruptcy and pain if it fails. For this reason and to keep the human spirit alive and vibrant some amount of hubris.

When modeling any real world process, as we strip complexity to make it computationally tractable, it is important to never loose sight of the abstractions we made.

We rely on the abstractions to get results BUT the abstractions are not the real thing.
As more and more of our work happens virtually and data analytics tools spread through the ranks of less experienced engineers we will see more of this phenomenon. What I call Analytics Hubris.

Earlier statesmen and senior scientists were familiar with this behavior. Younger men in their craze to climb the ladder mistake aptitude with a tool for wisdom.

I found this excerpt from the book Signals: The Breakdown of the Social Contract and the Rise of Geopolitics  which warns on the limitations of analytics/modelling (econometrics in this case) and the deeper knowledge you need to have in order to make decisions have systemic impact.




Saturday, May 5, 2018

Akin's Laws of Spacecraft Design* , (or anything engineering based)

  1. Engineering is done with numbers. Analysis without numbers is only an opinion.
  2. To design a spacecraft right takes an infinite amount of effort. This is why it's a good idea to design them to operate when some things are wrong .
  3. Design is an iterative process. The necessary number of iterations is one more than the number you have currently done. This is true at any point in time.
  4. Your best design efforts will inevitably wind up being useless in the final design. Learn to live with the disappointment.
  5. (Miller's Law) Three points determine a curve.
  6. (Mar's Law) Everything is linear if plotted log-log with a fat magic marker.
  7. At the start of any design effort, the person who most wants to be team leader is least likely to be capable of it.
  8. In nature, the optimum is almost always in the middle somewhere. Distrust assertions that the optimum is at an extreme point.
  9. Not having all the information you need is never a satisfactory excuse for not starting the analysis.
  10. When in doubt, estimate. In an emergency, guess. But be sure to go back and clean up the mess when the real numbers come along.
  11. Sometimes, the fastest way to get to the end is to throw everything out and start over.
  12. There is never a single right solution. There are always multiple wrong ones, though.
  13. Design is based on requirements. There's no justification for designing something one bit "better" than the requirements dictate.
  14. (Edison's Law) "Better" is the enemy of "good".
  15. (Shea's Law) The ability to improve a design occurs primarily at the interfaces. This is also the prime location for screwing it up.
  16. The previous people who did a similar analysis did not have a direct pipeline to the wisdom of the ages. There is therefore no reason to believe their analysis over yours. There is especially no reason to present their analysis as yours.
  17. The fact that an analysis appears in print has no relationship to the likelihood of its being correct.
  18. Past experience is excellent for providing a reality check. Too much reality can doom an otherwise worthwhile design, though.
  19. The odds are greatly against you being immensely smarter than everyone else in the field. If your analysis says your terminal velocity is twice the speed of light, you may have invented warp drive, but the chances are a lot better that you've screwed up.
  20. A bad design with a good presentation is doomed eventually. A good design with a bad presentation is doomed immediately.
  21. (Larrabee's Law) Half of everything you hear in a classroom is crap. Education is figuring out which half is which.
  22. When in doubt, document. (Documentation requirements will reach a maximum shortly after the termination of a program.)
  23. The schedule you develop will seem like a complete work of fiction up until the time your customer fires you for not meeting it.
  24. It's called a "Work Breakdown Structure" because the Work remaining will grow until you have a Breakdown, unless you enforce some Structure on it.
  25. (Bowden's Law) Following a testing failure, it's always possible to refine the analysis to show that you really had negative margins all along.
  26. (Montemerlo's Law) Don't do nuthin' dumb.
  27. (Varsi's Law) Schedules only move in one direction.
  28. (Ranger's Law) There ain't no such thing as a free launch.
  29. (von Tiesenhausen's Law of Program Management) To get an accurate estimate of final program requirements, multiply the initial time estimates by pi, and slide the decimal point on the cost estimates one place to the right.
  30. (von Tiesenhausen's Law of Engineering Design) If you want to have a maximum effect on the design of a new engineering system, learn to draw. Engineers always wind up designing the vehicle to look like the initial artist's concept.
  31. (Mo's Law of Evolutionary Development) You can't get to the moon by climbing successively taller trees.
  32. (Atkin's Law of Demonstrations) When the hardware is working perfectly, the really important visitors don't show up.
  33. (Patton's Law of Program Planning) A good plan violently executed now is better than a perfect plan next week.
  34. (Roosevelt's Law of Task Planning) Do what you can, where you are, with what you have.
  35. (de Saint-Exupery's Law of Design) A designer knows that he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away.
  36. Any run-of-the-mill engineer can design something which is elegant. A good engineer designs systems to be efficient. A great engineer designs them to be effective.
  37. (Henshaw's Law) One key to success in a mission is establishing clear lines of blame.
  38. Capabilities drive requirements, regardless of what the systems engineering textbooks say.
  39. Any exploration program which "just happens" to include a new launch vehicle is, de facto, a launch vehicle program.
  40. (alternate formulation) The three keys to keeping a new human space program affordable and on schedule:
    1. No new launch vehicles.
    2. No new launch vehicles.
    3. Whatever you do, don't develop any new launch vehicles.
  41. (McBryan's Law) You can't make it better until you make it work.
  42. There's never enough time to do it right, but somehow, there's always enough time to do it over.
  43. Space is a completely unforgiving environment. If you screw up the engineering, somebody dies (and there's no partial credit because most of the analysis was right...)

Akin's Laws of Spacecraft Design

Saturday, April 21, 2018

The "No Free Lunch" Theorem

The "No Free Lunch" theorem was first published by  David Wolpert and William Macready in their 1996 paper "No Free Lunch Theorems for Optimization".

In computational complexity and optimization the no free lunch theorem is a result that states that for certain types of mathematical problems, the computational cost of finding a solution, averaged over all problems in the class, is the same for any solution method. No solution therefore offers a 'short cut'. 

A model is a simplified version of the observations. The simplifications are meant to discard the superfluous details that are unlikely to generalize to new instances. However, to decide what data to keep , you must make assumptions. For example, a linear model makes the assumption that the data is fundamentally linear and the distance between the instances and the straight line is just noise, which can safely be ignored.

David Wolpert demonstrated that if you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. This is called the "No Free Lunch Theorem" (NFL).

NFL states that no model is a priori guaranteed to work better. The only way to know for sure which model is the best is to evaluate them all. Since this is not possible, in practice you make some reasonable assumptions about the data and you evaluate only a few reasonable models.




Sunday, April 15, 2018

Review : Focal Loss for Dense Object Detection

The paper Focal Loss for Dense Object Detection introduces a new self balancing loss function that aims to address the huge imbalance problem between foreground/background objects found in one-step object detection networks.

y : binary class {+1, -1}
p : probability of input correctly classified to binary class

Given Cross Entropy (CE) loss for binary classification:
CE(p, y) =
-log(p) ,  if y = 1
-log(1 - p), if y = -1

The paper introduces the Focal Loss (FL) term as follows
FL(p,y) =
-(1-p)^gamma * log(p), if y = +1
-(p)^gamma * log(1-p), if y = -1

With gamma values ranging from 0 (disabling focal loss, default CE) to 2.
Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives loss.
Easy examples are those that achieve p close to 0 and close to 1.

Example 1
gamma = 2.0
p = 0.9
y = +1
FL(0.9, +1) = - ( 1 - 0.9 ) ^ 2.0 * log(0.9) = 0.00045 
CE(0.9, +1) = - log(0.9) = 0.0457

Example 2
gamma = 2.0
p = 0.99
y = +1
FL(0.99, +1) = - ( 1 - 0.99 ) ^ 2.0 * log(0.99) = 0.000000436
CE(0.9,9 +1) = - log(0.99) = 0.00436

That means a near certainty (a very easy example) will have a very small FL compared cross entropy loss and an ambiguous result (close to p ~ 0.5) will have a much higher effect.

In practice the authors use an a-balanced variance of FL:

FL(p,y) = 
-a(y) * ( 1 - p ) ^ gamma * log(p), if y = +1
-a(y) * ( p )  ^ gamma * log(1 - p), if y = -1

Where a(y) is a multiplier term fixing the class imbalance. This form yields slightly improved accuracy over the non-a-balanced form.

The authors then go and build a network to show off the capabilities of their loss function. The network is called RetinaNet and it's a standard Feature Pyramid Network (FPN) Backbone with two subnets's (one object classification, one box regression) attached at each feature map. It's a very common implementation for a one stage detector, similar to SSD (edit, exactly the same as SSD) and YOLO. A slight differentiation is the prior addition when initializing the bias for the object classification network and sparse calculation when adding the total cost.



For a high level understanding of deep learning click here

Thursday, February 8, 2018

Critique on "Deep Learning: A Critical Appraisal "

Deep Learning: A Critical Appraisal 

Gary Marcus argues that deep learning is : 
1. Shallow : Meaning it has limited capacity for transfer 
2. Data Hungry: Requires millions of examples to generalize sufficiently
3. Not transparent enough: It is treated as a black box

I'm not an academic but I've been reading research papers and I've seen a huge effort on all 3 fronts. (cudos to https://blog.acolyer.org/

New architectures and layers that require far fewer data and can be used for several unrelated tasks. 
A lot of opening the black box approachs based on anything from MDL, to information theory and statistics on interpreting the weights, layers and results. 

It's not all doom and gloom but huge the milestone jumps like the ones we had in the last 5 years in most AI/ML tasks are probably in the past. What we will see is a culling of a lot of bad tech and hype and the quiet rise of Differentiable Neural Computing.  



For a high level understanding of deep learning click here

Monday, January 29, 2018

Peter Thiel's 7 questions on startups

Fom Peter Thiel's "Zero To one", notes on startups 

All excellent questions before you start any venture :
  1. Engineering : Can you create breakthrough technology instead of incremental improvements ?
  2. Timing : Is now the right time to start your particular business ?
  3. Monopoly : Are you starting with a big share of a small market ?
  4. People : Do you have the right team ?
  5. Distribution : Do you have a way to not just create but deliver your product ?
  6. Durability : Will your market position be defensible 10 and 20 years into the future ? 
  7. Secret : Have you identified a unique opportunity that others don’t see ? 


Wednesday, January 10, 2018

Compiling Tensorflow under Debian Linux with GPU support and CPU extensions


Tensorflow is a wonderful tool for Differentiable Neural Computing (DNC) and has enjoyed great success and market share in the Deep Learning arena. We usually use it with python in a prebuild fashion using Anaconda or pip repositories. What we miss that way is the chance to enable optimizations to better use our processing capabilities as well as do some lower level computing using C/C++.

The purpose of this post is to be a guide for compiling Tensorflow r1.4 on Linux with CUDA GPU support and the high performance AVX and SSE CPU extensions.

This guide is largely based on the official Tensorflow Guide and this snippet with some bug fixes from my side.

1. Install python adependencies:


sudo apt-get install python-numpy python-dev python-pip python-wheel python-setuptools

2. Install GPU prerequisites:
  • CUDA developer and drivers
  • CUDNN developer and runtime
  • CUBLAS
Make sure cuddn libs are copied inside the cuda/lib64 directory usually found under /usr/local/cuda.


sudo apt-get install libcupti-dev

3. Install Bazel google's custom build tool:


sudo apt-get install openjdk-8-jdk

echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list

curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -

sudo apt-get update && sudo apt-get install bazel

sudo apt-get upgrade bazel


4. Configure Tensorflow:


git clone https://github.com/tensorflow/tensorflow

cd tensorflow

git checkout r1.4

## don't use clang for nvcc backend [https://github.com/tensorflow/tensorflow/issues/11807] 
## when asked for the path to the gcc compiler, make sure it points to a version <= 5 
./configure


5. Compile with the SSE and AVX flags and install using pip:


# set locale to en_us [https://github.com/tensorflow/tensorflow/issues/36]

export LC_ALL=en_us.UTF-8

export LANG=en_us.UTF-8

bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --incompatible_load_argument_is_label=false --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

sudo pip install /tmp/tensorflow_pkg/tensorflow-1.4.1*


If you get a nasm broken link error :
edit  tensorflow/tensorflow/workspace.bzl and add an extra link

urls = [
          "https://mirror.bazel.build/www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2",  
          "http://www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2",
          "http://pkgs.fedoraproject.org/repo/pkgs/nasm/nasm-2.12.02.tar.bz2/d15843c3fb7db39af80571ee27ec6fad/nasm-2.12.02.tar.bz2",
      ]


6. Test that everything works:


cd ~/

python

>>> import tensorflow as tf

>>> session = tf.InteractiveSession()
>>> init = tf.global_variables_initializer()
## At this point if your get a malloc.c assertion failure, it is due to a wrong CUDA configuration (ie not using the runtime version)

At this point there should not be any CPU warning and the GPU should be initialized.

Thursday, August 31, 2017

soviet iphone

I bought an iphone for a contract app. When my android phone died after 4 years of use I thought I would use it, How different can it be ?

It sure is shiny, fast, works well under stress and the battery lasts considerably longer BUT what you give up for convenience is flexibility. Try doing anything out of the ordinary, like changing your freaking ringtone to a custom tune.

I checked, it takes 12 steps. Part of that is installing the malware mess called itunes.

So you DON'T OWN your phone. Itunes owns your phone and it will decide to do whatever the hell it wants with it. Iphone is your soviet appartment and itunes is your assigned commisar. You want something changed ? Have fun complaining to him, you may end up with no appartment. I wont even go to the privacy issues because ... you know ... google.

I'll still keep it because it cost me 600 euros, but only as a dumb media player.


Friday, August 11, 2017

Confucius on settings

For every project there's a configuration.

In pet projects it may be a handful of constants and in bigger projects in may be a server configuration.

The simple truth is that configuration usually keeps building up, and so the more the succesful the project the bigger the configuration.

It is wiser to build it from the start than collect everything mid-way.




Saturday, July 29, 2017

Yodiwo joins Endeavour network


Yodiwo has been officialy accepted in the Endeavor network.
Entrepreneur: Alex Maniatopoulos
Company: Yodiwo
Description: What isn’t connected to the Internet nowadays? Though more systems are coming online, implementing Internet of Things (IoT) projects is incredibly complex, time-intensive, and costly. Engineers write thousands of lines of code in order to connect expensive smart devices, manage system workflows, and measure results. Yodiwo offers an affordable, code-free IoT application enablement platform plus customized solutions to expedite the process of constructing IoT applications and interconnected networks. Yodiwo’s proprietary three-tiered platform–Wisper, Cyan, and Alcyone–saves clients 90% on application development time, 30% on system operating costs, and and 40-50% on capital expenditure.
Our CEO Alex Maniatopoulos after a rigorous multiday interview convinced the panellists that yodiwo had all it takes to join their sucessful network.

This great milestone opens a brand new world for yodiwo and many business opportunities.


Thursday, July 6, 2017

Deep Learning : Why you should use gradient clipping

One new tip that I got reading Deep Learning is clipping gradients. It's been common knowledge amongst practitioners for years but somehow I missed it.

Problem

The problem with strongly nonlinear objective functions, such as those computed in recurrent or deep networks, is that their derivatives tend to be either very large or ver small in magnitude. 

The steep regions resemble cliffs and they result from the multiplication of several large weights together. On the face of an extremely steep cliff structure, the gradient update step can move the parameters extremely far, usually jumping off the cliff structure together, undoing much of the work that had been done to reach the current solution. 

The gradient tells us the direction that corresponds to the steepest descent within an infinitesimal region surrounding the current parameters. Outside this tiny region, the cost function may begin to curve back upward. The update must be chosen to be small enough to avoid traversing too much upward curvature. 

Solution

One solution would be to have very small learning rate. This solution is problematic as it will slow training and maybe settle in a sub-optimal region. A much better solution is clipping the gradient (or norm-clipping). There are many instantiations of this idea but the main concept is to limit the gradient to a maximum number and if the gradient exceeds that number rescale the gradient parameters so the are limited within it. This customization retains the direction but limits the step size.

Thoughts

If your jobs involves training a lot of deep learning models automatically, then you should eliminate any unpredicable steps that require manual labor. We are engineers after all so whatever can be automated should be automated and no more. For me the problem was the unpredictability of the training. For a percentage of initializations in training mode gradient would explode. The reactionary solution was to lower the learning rate, but that costs time and money. In addition to that I wanted something that always works and thus can automated. Gradient clipping worked nicely in this regard and it allowed me to up the learning rate so that the training converges much faster.

Conclusion

Use gradient clipping everywhere, my default option is to limit to 1. In Caffe it is a single line in the solver and if your framework doesn't support it is easy to implement it yourself. You will save yourself enormous headaches and time.

* From the book "Deep Learning"

For a high level understanding of deep learning click here

Monday, July 3, 2017

Udacity AI nanodegree

I enrolled in Udacity's AI nanodegree 2 months ago and I just learned I was accepted.
I thought it would be a good refresher and maybe fill in some knowledge gaps I have.
The reviews on the net are pretty good so I'm pretty sure it will be a great experience especially since there will be AI legends like Peter Norvig doing the teaching.

The curriculum consits of five parts

  1.  Foundations of AI : In this Term, you'll learn the foundations of AI with Sebastian Thrun, Peter Norvig, and Thad Starner. We'll cover Game-Playing, Search, Optimization, Probabilistic AIs, and Hidden Markov Models. 
  2.  Deep Learning and Applications : In this term, you'll learn the cutting edge advancements of AI and Deep Learning. You'll get the chance to apply Deep Learning on a variety of different topics including Computer Vision, Speech, and Natural Language Processing. We'll cover Convolutional Neural Networks, Recurrent Neural Networks, and other advanced models. 
  3. Computer Vision : In this module, you will learn how to build intelligent systems that can see and understand the world using Computer Vision. You'll learn fundamental techniques for tasks like Object Recognition, Face Detection, Video Analysis, etc., and integrate classic methods with more modern Convolutional Neural Networks.
  4. Natural Language Processing : In this module, you will build end-to-end Natural Language Processing pipelines, starting from text processing, to feature extraction and modeling for different tasks such as Sentiment Analysis, Spam Detection and Machine Translation. You'll also learn how to design Recurrent Neural Networks for challenging NLP applications.
  5. Voice User Interfaces : This module will help you get started in the exciting and fast-growing area of designing Voice User Interfaces! You'll learn how to build Conversational Agents for products and services more natural to interact with. You will also dive deeper into the core challenge of Speech Recognition, applying Recurrent Neural Networks to solve it.
In my projects so far I've mostly tackled Computer Vision and Predictive Analytics problems, so it would be a nice change to dive into NLP and Voice processing.
I hope I can fit it in my busy schedule and I'll try to write some posts describing the experience for any future students.

Wednesday, June 14, 2017

Classic Machine Learning Literature

I'm often asked by software engineers on what to read to get into the Machine Learning world
For that purpose I've compiled a list of Machine Learning and Applied Mathematics books that I've used to gain a deeper understanding.

Machine Learning

We start of with the classic but very dated "Machine Learning" by Tom M. Mitchell.
This was the first one I read on the subject. Low on math, high on intuition, it is a descent introductory book. You can easily implement most of the algorithms described and get a fair understanding of what's going on. First couple of years in the business you may use it as basic reference but after that you will need the math heavy books.

Pattern Classification

We continue with my personal favorite "Pattern Classification" 2nd edition by Richard O. Duda, Peter E. Hart, David G. Stock. This impressive book is heavy on applied math, low on proofs and very readable. It is better used by beginners as well as experienced machine learning engineers. It builds the reader a very good intuition and understanding. The graphs and figures help a lot. I still use it as a reference on some issues.






Pattern Recognition and Machine Learning

A natural extension of "Pattern Classification" is the excellent "Pattern Recognition and Machine Learning" by Bishop. Somewhat heavy on the math, it provides a clear path of understanding but it is not for noobies. You should come into this book with some experience. This excellent book is still very relevant with great introduction on matrix calculus and probability theory.


Probabilistic Graphical Models

Going deeper, I refer to "Probabilistic Graphical Models". This is a subdomain of Machine Learning and it is not for the faint of heart. This massive book is hard, and I mean eyes glazing, concentrate and get a headache hard. If you manage to get through it you will have a greater understanding than most mortals. If however you are like me you are just gonna sample some of the parts and leave the rest for the PhD's.

Deep Learning

A new book that has gained classic status very fast is the "Deep Learning" by Ian Goodfellow and Yoshua Bengio. I found it very approachable and left me with a better understanding of deep learning. Very light on math, it concentrates on intuition and best practices rather than proofs. Highly recommended for all DL practitioners.
For a high level understanding of deep learning click here




Back to basics books

Numerical Recipes

Most books rely heavily on linear algebra, probability theory and algorithm "primitives". If you really want to know whats under the hood you should check this out.




Statistical Digital Signal Processing and Modeling

Before the Machine Learning and AI hype there was simply DSP.


Artificial Intelligence: A Modern Aproach

A general purpose AI book. Lots of good content, ideas, algorithms, though process, if a bit dated. I used the second edition, apparently the latest one is a bit better.

Matrix Computations

If you really really want to reinvent the wheel and by wheel I mean super fast BLAS primitives usually found in LAPACK and its variants, look no further than here.