In this post, we will understand how a CPU executes instructions on a high level by illustrating the instruction cycle using a step-by-step example. We also cover interrupts and how they affect the instruction cycle.

A CPU executes instructions using a cycle of steps known as the instruction cycle.

The instruction cycle consists of three steps to fetch, decode, and execute instructions. It is, therefore, also known as the fetch-decode-execute cycle.

- The fetch operation retrieves the instruction from memory and moves it to the CPU.
- The decode operation unpacks the instruction so that the CPU can understand what to do.
- The execute operation carries out the instruction.

Sometimes the cycle is described as consisting only of the fetch and the execute operation.

This cycle is repeated continuously as long as the computer is running. If it stops, the computer has either been turned off, or it has crashed.

In the blog post on the von Neumann Architecture, we established that the CPU consists of a control unit for processing the instructions sent to the CPU, the arithmetic logic unit for performing the operations specified in the instructions, and registers for storing instructions and data that are immediately required by the CPU.

The fetch-decode-execute cycle makes use of these components in addition to the memory unit.

In the fetch step of the cycle, the instructions are retrieved from the memory unit (RAM) and stored in the registers on the CPU. Next, the control unit decodes the instructions, which are then executed by the arithmetic and logic unit. The results of the instruction execution are sent back to RAM for storage, and the next instruction cycle begins.

The number of instruction cycles a CPU can execute is stated as clock speed and measured in Hertz. If a CPU has a clock speed of 2 700 000 000 Hertz or 2.7 GHz, it executes 2.7 billion instruction cycles per second.

In the following section, we will walk through the operations performed during the instruction cycle. Recall that a CPU has several different registers

- Program counter
- Memory address register
- Memory data register
- Current instruction register
- Accumulator

For an explanation of what these registers do, check out my post on von Neumann architecture.

The instruction cycle begins with the fetch operation. The program counter keeps track of the next instruction to be processed. A fetch operation starts by loading the memory address of the next instruction into the program counter. In the next step, the processor transfers the address from the program counter to the memory address register and subsequently loads the data stored at that memory location into the memory data register. The program counter is automatically incremented to the next memory location unless the current instruction explicitly points to a different memory location for the next instruction.

Let’s see how that works in practice using a concrete example:

- The program counter initially points to the memory address 001
- The memory address 001 is loaded into the memory address register by the processor
- The processor next retrieves the instruction stored at memory address 001 and loads it into the memory data register.
- Since the data contains the instruction “Get 203”, it is forwarded to the instruction register.
- The program counter is incremented by 1, pointing to 002
- The instruction in the instruction register is forwarded to the control unit.

After the fetch operation, the instruction cycle continues with the decode and execute portions. During the fetch stage, the control unit has been supplied with the instruction. It now needs to decode the instruction so that the processor can understand what to do next. In our example, I’ve supplied the instruction in plain English, such as “Get 203” which tells the processor to get the piece of data stored at memory location 203. In memory, the instruction is supplied in binary. For example, in a 16-bit memory, the first 4 bits may encode the operation to be performed, while the remaining bits specify the address from which to load the data.

Lastly, the processor will execute the instruction supplied. So if the instruction is to get some other piece of data, the “execute” action will consist of retrieving the data from the supplied memory address and storing it in the appropriate register. If the instruction specifies a calculation such as adding two numbers, the execution of the calculation will be handed off to the arithmetic and logic unit (ALU)

Let’s continue with our concrete example:

- The Control Unit decodes the instruction and tells the processor to go to memory address 203 and fetch the piece of information stored there.
- The address 203, is stored in the memory address register.
- The data, the number 4, is stored in the memory data register.
- The
- Since the data is a number that will be necessary for a future operation and not another instruction, the number is stored in the accumulator.

This concludes the first fetch-decode-execute cycle. The processor starts the next cycle by fetching the next instruction stored in the program counter.

The fetching process is the same as in the previous cycle. This time, the instructions tell the processor to add the number stored at memory location 204 to the number currently stored in the accumulator.

After fetching the instruction, the processor retrieves the number stored at memory address 204 and places it in the accumulator while the previously stored number is forwarded to the arithmetic and logic unit (ALU).

Then, the number 3 is also forwarded to the arithmetic and logic unit, where the addition specified in the instruction is performed. Finally, the result is returned to the accumulator, where it will sit until the next instruction is executed.

In this example, we’ve used two instruction cycles to perform the addition. But modern processors may also load several pieces of data and perform calculations in one cycle.

As the term implies, an interrupt is a mechanism by which the normal course of actions of the processor is interrupted. This may be necessary for a variety of reasons, such as hardware failure or waiting for an I/O operation to complete.

Interrupts are part of a broader class of events known as exceptions. Exceptions essentially handle cases when the CPU encounters conditions that interfere with normal processing.

The main utility of interrupts lies in their ability to improve efficiency. Performing I/O operations is usually orders of magnitude slower than normal processing. If the computer had to communicate with an external device attached via USB, such as a flash drive, without the use of interrupts, the processor would have to wait until the i/O operation completes. The processor would spend thousands of instruction cycles just polling the peripheral device, asking if it was done processing without doing any useful work.

To make processing more efficient, the processor can receive an interrupt signal from the I/O device enabling it to work on something that is unrelated to the I/O operation while that operation is in progress.

Once the I/O device is done with its operations and requires communication with the processor, it sends an interrupt request signal to the processor. The processor then interrupts the execution of its current program and services the I/O device. This is achieved via a special device known as the interrupt handler. When the processor is finished with the I/O processing, it returns to the original process.

Once an Interrupt signal arrives, the processor has to perform a series of steps to handle the interrupt and continue processing:

- The CPU needs to save the current context as it exists in the registers to memory. Some processor architectures push the context onto a stack and then pop it off the stack. That way, they can restore the previous context in reverse order.
- Secondly, the CPU needs to retrieve the instructions of the interrupt handler from memory. The interrupt handler is basically a set of instructions stored in memory. Each type of interrupt has its own associated set of instructions.
- The CPU executes the instructions specified by the interrupt handler
- After concluding the operations specified by the interrupt handler, the CPU needs to restore the context of its previous operations by loading the associated instructions and data into the registers
- The CPU continues the previous flow of operations.

To include the handling of interrupts into the instruction cycle, an additional interrupt cycle is included.

If multiple interrupts occur, there are essentially two options.

If an interrupt is currently being handled and a second interrupt occurs, the processor can push the second and all subsequent interrupts onto a stack and execute them in reverse sequential order. This has the disadvantage that we cannot prioritize interrupts. If an I/O device that causes a notoriously long interrupt, like a printer, is currently executing, all other interrupts would have to wait.

Alternatively, interrupts can be associated with priorities. If an interrupt with a higher priority were to occur while a lower priority interrupt is being handled, the lower priority interrupt would itself be interrupted. The processor then would handle the higher-priority interrupt first before turning back to the lower-priority one. Naturally, the second approach engenders more complexity but is usually more efficient.

So far, we have focused on interrupts as caused by I/O devices. In fact, there are several reasons for a processor to interrupt its course of action leading to different types of interrupt handlers.

A computer relies on electricity. If there is a power outage or something overheats, the processor needs to be able to handle that case when the underlying hardware fails. The hardware failure handler also kicks in when there is an inconsistency in memory access. For example, if a piece of data is different in memory when it is accessed from when it was stored, it may cause system crashes.

Interrupts may be generated by the processor on a regular basis to perform updates or other functions that may be necessary.

If an error occurs during the execution of a program, the program itself can trigger an interrupt. If you are a programmer, you probably have run into buffer overflow errors or other errors generated when executing your program. These errors are triggered when your program attempts to do something that the processor cannot or will not handle. In that case, the processor will generate an interrupt or an exception. In fact, good programmers anticipate potential modes of failure and handle these through exceptions in their code.

The controller of an I/O device can trigger interrupts as described previously. They either start a new interrupt by requesting service from the processor, signal normal completion of an I/O process, or indicate an error condition.

The post How does a CPU Execute Instructions: Understanding Instruction Cycles first appeared on Programmathically.

]]>The von Neumann architecture is a foundational computer hardware architecture that most modern computer systems are built upon. It consists of the control unit, the arithmetic, and logic unit, the memory unit, registers, and input and output devices.

The key features of the von Neumann architecture are:

- Data as well as the program operating on that data are stored in a single read-write memory unit.
- All contents stored in memory are addressable by storage location
- Execution of program instructions occurs in a sequential manner

Up Until the von Neumann architecture emerged, a computer’s ability to execute a program was directly tied to its hardware configuration. In order to change the program, parts of the computer’s hardware had to be rebuilt and reorganized which entailed a lot of work and effort.

The first programmable digital computer was developed by Alan Turing and his colleagues at Bletchley Park in the UK during World War II. At that time the British government was looking for a way to crack German radio communications. The Germans and their allies were communicating via wireless radio signals using morse code. The signals were easy to intercept, but for a long time, they couldn’t be read by the Allies because the codes were encrypted using a device known as the Enigma machine.

The colossus had been designed previously by an engineer at the General Post Office to solve a particular mathematical problem. Turing built on the design of the colossus to successfully decipher intercepted Enigma messages.

The colossus went through several iterations that were kept secret for a long time. The architecture of the colossus was not very convenient for general-purpose computing, since it could not store program instructions. Instructions were programmed through hardware manipulation in the form of switches and plugs.

In 1945 John von Neumann, one of the luminaries of 20th-century computer science and math, proposed an architecture that would allow for re-programmability without changes to the computer’s hardware. The basic idea was that not only the data but also the program instructions themselves would be stored in digital memory rather than being expressed through hardware.

The instructions would be communicated to the unit performing the calculations through control signals rather than switches and plugs. This architecture requires an additional module to translate the instructions as received through the input device into control signals. These control signals along with the data would then be transmitted to a general purpose arithmetic and logic unit which would perform the calculations as instructed by the signals.

The instructions are provided to the module in the form of codes that became known as software. Instead of rebuilding the hardware each time, the programmer just had to change the codes he sent to the interpreter.

On a high level, a computer system consists of 3 main components:

- A CPU or Central Processing Unit to perform operations and calculations
- Memory to hold data and instructions
- Input/Output devices to send instructions and data to the computer and receive output generated by the computer.

The central processing unit is the core part of the computer where programs are executed, and data is processed. It consists of several components.

The arithmetic logic unit is responsible for performing the calculations and applying the rules provided by a program to the data. At a fundamental level, the ALU performs basic mathematical operations and boolean logic such as addition, subtraction, and comparisons like greater than, smaller than or equal to.

The control unit processes the program instructions and tells the arithmetic logic unit what to do on the basis of the instructions. It is also responsible for controlling how data moves around the computer system.

Registers are a type of short-term memory that is directly integrated into the CPU. They are used to store instructions and data is immediately used by the CPU. The CPU usually has the following registers:

- The
**Program Counter**(PC) tracks the main memory address of the next program instruction that the CPU requires from the main memory and passes it to the memory address register. - The
**Memory Address Register**(MAR) holds the location of current program instruction in the memory unit that is about to be fetched from or written to memory. The memory address register allows the CPU to know where to retrieve the input or write the output of the current instruction. The memory address register only stores addresses in memory but not data and instructions. - The
**Memory Data Register**(MDR) temporarily stores the contents from main memory stored at the address held by the memory address register. Contrary to the memory address register, the memory data register stores actual data and instructions. If the content contains instructions they are passed on to the current instruction register. - The
**Current Instruction Register**(CIR) stores the most recent instruction before it is executed by the CPU. - The
**Accumulator**(AC) stores the results produced by the arithmetic logic unit.

The cache is an intermediate memory between RAM and the registers. Fetching data from the separate memory unit takes a long time. If computers fetched data directly from memory into the CPU for every small step of a calculation, the CPU would spend a lot of time doing nothing and waiting for the data to arrive from the memory unit. The cache is built directly into the CPU and much faster than RAM but also much more expensive to manufacture. Cache usually only stores those instructions frequently used by the CPU.

As the name implies, the job of the memory unit is to store all the data and instructions that are required by the computer. Main memory holds Gigabytes worth of information in modern computers. It is also referred to as random access memory because the information stored in it can be accessed in any order. RAM can only be used for temporary storage since all the information is wiped out once the computer is shut off. Programs and data that need to survive for more than one session are therefore stored on secondary storage such as hard drives and are loaded into RAM once the computer needs to use them.

There is a second type of memory that stores information across sessions known as ROM or read-only memory. As the name implies, no new information can be written to ROM. ROM is used to store things like the startup program for your computer that should not be changed but has to be quickly accessible once the machine boots up.

**A Note on Secondary Storage**

Secondary Storage refers to things like hard drives that are used to store information across boot cycles. Strictly speaking, they are external devices and therefore not part of the classical von Neumann architecture. Accordingly, I won’t cover them in this post.

Busses are lanes that allow instructions and data to travel between components.

We distinguish between three types of buses.

- The
**control bus**is used to transfer signals from the CPU to the memory and external I/O devices initiating read and write operations and ensuring that the operations are properly synchronized. - Via the
**address bus,**the CPU sends memory addresses instructing the memory unit or secondary storage where it wants information to be written to or read from. - The
**data bus**carries data between the CPU, the memory, and I/O devices. When performing write operations, the data will travel from the CPU to the memory unit and I/O devices. When performing read operations the data moves from memory and I/O devices to the CPU.

The von Neumann architecture describes a way of building computers that makes it possible to program computers through software without the need to adjust the hardware for every new program.

But how do the programmers of the software know that the hardware can handle their instructions in software?

The instruction set architecture is an abstract machine model that defines the set of instructions that the processor can handle. These definitions include things like data types, types of operations and how they are executed, and execution semantics such as how to deal with interrupts. Thus, if two computers have the same instruction set architecture, the programmer knows that the software that runs on one computer will also run on the other one.

The microarchitecture describes how the instruction set architecture is implemented at a hardware level. It is concerned with the physical design of the processor such as the bus widths, size of the cache as well as conceptual questions like the ordering of executions. The job of the microarchitecture is to optimize performance across metrics such as speed, cost, and energy consumption.

The instruction set architecture tells the programmer whether the computer is compatible with the types of instructions that his or her software wants to execute. The microarchitecture tells the programmer whether the required performance can be achieved.

In von Neumann architectures the power of the CPU and the size of memory cannot be fully exploited because the data cannot be transported fast enough between memory and the CPU.

This problem arises because transfer between memory and the CPU is inherently sequential. We can try to increase the clock speed or the bus size to execute data transfer faster and at greater volumes. But that way we will very quickly run into some hard physical limitations.

Modern CPUs can usually process data at a much faster rate than it can be transferred to of from the CPU. As a consequence, the CPU will spend a lot of time idle.

The post What is the Von Neumann Architecture first appeared on Programmathically.

]]>In this post, we will understand the Liskov substitution principle and illustrate how it works with an extended example in Java.

**The Liskov substitution principle states that an object of a superclass should be replaceable with an object of any of its subclasses. It is one of the SOLID design principles in object-oriented software engineering. **

The ability to replace a superclass with any of its subclasses may sound trivial at first to anyone familiar with object-oriented design. After all, an inheritance hierarchy is created to enable subclasses to inherit functionality from superclasses. Doesn’t that imply that subclasses have at least the same functionality as their superclasses and can therefore replace superclasses? Many superclasses are abstract, so you can only use subclasses when the compiler expects an object of the superclass. So why do we need to Liskov Substitution principle?

Building a class hierarchy of a superclass and subclasses that implement the superclasses methods does not mean the abstractions used in the inheritance hierarchy are logically sound. The Liskov substitution principle helps ensure that the abstraction hierarchies you build are logically sound.

Imagine we are building a marketplace website for used vehicles. People can sell and buy all sorts of vehicles, from bicycles to race cars.

To avoid code duplication, we create a superclass called vehicle. All vehicles share some functionality on our site. For example, you can access their age and number of previous owners. Accordingly, we can add these methods to the superclass.

Here is a simple implementation of the vehicle class in pseudocode:

class Vehicle { getAge(){ return age; } getNoPrevOwners(){ return noOwners; } }

So far, so good. The individual subclasses do not have to have their own implementations of the getAge and getNoPrevOwners methods, which significantly reduces code overhead.

When we want to launch the first version of the platform, we realize that all the individual subclasses for the vehicles we currently have in our system need some method to access the fuel requirements. To reduce code, we decide to add a method for accessing fuel requirements per 100 miles to the vehicle superclass.

class Vehicle { getAge(){ return age; } getNoPrevOwners(){ return noOwners; } get100MileFuelRequirements(){ return gallonsRequired; } }

In our next release cycle, we also add the functionality to purchase bicycles. Bicycles, of course, do not require fuel. We may naively think that that’s not a problem because the bicycle class simply doesn’t need to implement the method for adding fuel.

But that would be a violation of the Liskov substitution principle. Since the bicycle class does not implement the method for checking fuel requirements, it cannot replace an object of its superclass.

**Why is this a problem? **

If we make a class a subclass of another class, clients using the code can expect the subclass objects to behave like the superclass objects and to conform to the contract established by the superclass. In our specific case, a client using a bicycle object may try to call the get100MileFuelRequirements method. He would even be able to do so since the method exists in the superclass.

Let’s use the same example covered above and implement it in Java to see how the Liskov substitution principle works in practice. We start by creating a class for vehicles constituting our superclass.

public abstract class Vehicle { public void getAge(){ System.out.println("Age: "); } public void getNoPreviousOwners(){ System.out.println("Previous Owners:"); } public void get100MileFuelRequirements() { System.out.println("Required Fuel:"); } }

Naturally, nobody will purchase a generic vehicle, so we make the class abstract. At runtime, we expect only subclass implementations because everyone on the site will want to purchase a specific car or bike. However, we do want the vehicle class to contain non-abstract methods so that we can implement functionality that is shared across subclasses saving us some work later on.

The most common type of vehicle is a car, so we implement the car class first.

The methods in the vehicle class already contain some functionality to print a generic string that is not bound to the concrete implementation. To make use of that functionality in the car class, we override the methods in its superclass class, call the implementation in the superclass, and extend them by some code specific to that car.

public class Car extends Vehicle { @Override public void getAge() { super.getAge(); //go into the database and retrieve the age of the concrete car System.out.println("2 years"); } @Override public void getNoPreviousOwners() { super.getNoPreviousOwners(); //go into the database and retrieve the number of previous owners System.out.println("1 prev. owner"); } @Override public void get100MileFuelRequirements() { super.get100MileFuelRequirements(); //go into the database and retrieve the 100 mile fuel requirement System.out.println("4 gallons"); } }

When we call these methods, they should print a generic string from the superclass, followed by the specific information for the object at hand. Since I’ve written this code purely for demonstration, I’ve hardcoded the specific information in the subclass. In a real system, you would probably retrieve that information from a database.

Let’s compile and run the code to check our assumptions. To do that, we create an additional class we call control and put the main method in it that we can execute. We create an object of the car class and call the methods in the class.

public class Control { public static void main(String args[]){ Vehicle car = new Car(); car.getAge(); car.getNoPreviousOwners(); car.get100MileFuelRequirements(); } }

For each method, we should get the string printed by its implementation in the vehicle class, followed by the string printed in the overriding subclass.

So far so good. The car class makes use of all the methods implemented in the vehicle class and the implementations work as expected.

As a next step, we add the bicycle subclass. A bicycle does not require fuel, so we simply skip the implementation of the get100MileFuelRequirements() method naively thinking that that is not a problem.

public class Bicycle extends Vehicle { @Override public void getAge() { super.getAge(); } @Override public void getNoPreviousOwners() { super.getNoPreviousOwners(); } }

But the method for retrieving fuel requirements has an implementation in the vehicle class. Since the bicycle class extends the vehicle class, clients using the vehicle class can call that implementation on the bicycle class.

public static void main(String args[]){ Vehicle bike = new Bicycle(); bike.get100MileFuelRequirements(); }

The code compiles and runs without any problems, but leaves us with a meaningless fragment of text. In this case, the customer will just receive a confusing message, but the results could be worse.

The problem here is that we have chosen the wrong abstraction when implementing the vehicle class.

To conform to the Liskov substitution principle, we have to make sure that all subclasses actually require implementations of the methods we put into the vehicle class. That’s not the case here. There are vehicles such as bicycles that do not require fuel and therefore the get100MileFuelRequirements() method should not be in the vehicle class.

Instead, we probably require an intermediate abstraction such as “Fuellable Vehicles” that specifically only covers vehicles requiring fuel.

We remove the get100MileFuelRequirements() method from the vehicle class and move it into a new class FuelableVehicle.

public abstract class Vehicle { public void getAge(){ System.out.println("Age: "); } public void getNoPreviousOwners(){ System.out.println("Previous Owners:"); } }

public abstract class FuelableVehicle extends Vehicle { public void get100MileFuelRequirements() { System.out.println("Required Fuel:"); } }

The fuel-able vehicle extends the vehicle class so we can still make use of the methods that are common to all vehicles. Since the fuel-able vehicle still represents an abstraction and not a concrete vehicle that can be purchased, we make the class abstract.

Next, we have the car class conform to the fuel-able vehicle superclass:

public class Car extends FuelableVehicle { ... }

The bicycle remains a direct subclass of the vehicle class unless we are able to identify a subgroup of vehicles with specific functionality that the bicycle class belongs. Now our abstraction hierarchy is logically sound because the vehicle class only contains functionality that all the concrete vehicles implement. The same is true for fuel-able vehicles.

The post What is the Liskov Substitution Principle: An Explanation with Examples in Java first appeared on Programmathically.

]]>The open-closed principle states that classes and modules in a software system should be open for extension but closed for modification. It is one of the SOLID principles for software design.

In this post, we will build a step-by-step understanding of the open/closed principle using examples in Java.

Software systems that are not carefully designed, tend to develop an increasing amount of unwanted interdependencies. Often these interdependencies are accidental, and they accumulate to such an extent that it becomes increasingly hard to keep track of them. Changes in one part of such a codebase often necessitate changes somewhere else. These changes, in turn, require modifications in another part of the system leading to a domino effect. If you have to change the code in various parts of the codebase to fulfill a new requirement, your system will become unstable and increasingly behave in a non-deterministic manner.

The open/closed principle helps you avoid such a scenario by requiring you to design your system in a manner that makes modifications largely unnecessary. All extensions to the functionality of the system should be achieved by adding new code rather than changing existing code. This way, you are much less likely to run into unwanted ripple effects that necessitate changes in several parts of the system and lead to instability.

This all sounds very abstract. You probably want to know how you can design a software system in a manner that avoids modifications to existing code. Let’s jump into an example to make this concrete.

Suppose a developer is tasked with extending the functionality of a software system for managing a bookstore. The system contains a **BookQuery** class that contains methods to retrieve information about the books currently available in that store. I’ve used the same example in the post on the single responsibility principle. In that post, our developer has had to learn the hard way that he should not put the logic for the accounting team into the BookQuery class. This time, the developer’s job is to add functionality that allows clerks to distinguish between different editions of a book. The BookQuery class contains a method to retrieve the price of a book by title.

getBookPricebyTitle(String title)

Having learned his lesson about the single responsibility principle, the developer now thinks to himself: The book query class is responsible only to the clerk. Now the clerk’s requirements have changed, so it makes sense to change the code in the book query class. Different editions or the hardcover and paperback versions of the same book could have different prices. Therefore, I need to change the method getBookPriceByTitle to reflect these differences. The most straightforward way to do this is to simply add two more arguments to the function call:

getBookPricebyTitle(String title, String edition, Enum bookType)

The enum book type distinguishes between paperback, hardcover, and ebook versions. The edition is passed as a string. Subsequently, the developer changes the database call inside the function to include the edition and the book type in addition to the title.

He happily pushes the changes into production and informs the frontend development team of the changes. The next day, he checks his mailbox. It is full of voice messages urging him to call back. Book clerks across the country are unable to retrieve book prices. They just receive an error when they try to. You can probably imagine what happened here. The clients used by the clerks still rely on the old method. The developer has changed it and just relied on the frontend team to adjust the client to the new method signature.

To avoid the problems described in the previous section, the developer should not touch the original method that retrieves the book price by title. If he does, he will immediately break the code of all systems that rely on this method. Instead, it is much safer to add a new method to the class which has the new signature. To be as expressive as possible and distinguish this new method clearly from the existing method, I call it “getBookPricebyTitleEditionType”.

getBookPricebyTitleEditionType(String title, String edition, Enum bookType)

This is the open/closed principle in action. You’ve extended your existing class but did not modify the existing code in that class.

Some people might find this unappealing. Firstly, you have a really long and ugly name, and secondly, you are cluttering your class with additional code. If you do this several times, you could easily end up with a huge monster class that has dozens or hundreds of methods that are only marginally distinct from each other.

Regarding the first issue, the method name should clearly express what the method is doing. If you can make the name nice and short while explaining what the method does, that’s great. But remember that in software development expressivity trumps aesthetics.

Some languages like Java allow method overloading which allows you to have multiple methods with the same name but different parameter signatures. If your language allows method overloading, I suggest that you make use of it and retain the same method name while only changing the parameter signature. It will simplify your code because you can keep the name very short.

getBookPrice(String title) getBookPrice(String title, String edition, Enum bookType)

The compiler will inform the client what implementations of that method name are available, so you don’t have to worry about mentioning that in the method name.

The danger that your class spirals out of control is there regardless of whether you apply the open/closed principle or not. Applying the open/closed principle does not mean that you should pile method upon method in your class without cleaning up what is already there. On the contrary, having more methods that do similar things will help you update and clean your code.

You may realize that you can refactor the code in your system to rely only on the new method. As a consequence, you can eliminate the “getBookPricebyTitle” method altogether. In my interpretation, this would not constitute a violation of the open/closed principle since you are not modifying existing code but throwing out old, unused code (something you should absolutely do on a regular basis).

Of course, this is possible only if the BookQueryClass is purely used internally by components that you and your team control. In a clean software system, the majority of your code should be internally facing. The external client should only interact with a few carefully selected methods.

When it comes to the external facing interface, sticking to the open/closed principle becomes even more important. You don’t want the connections of your customers to break. Especially not if they are paying customers.

The implementation above might become problematic because you still have to change your interface by adding the additional method. This might break external clients relying on the interface. To address this problem, you want to close your interface and your classes for modification.

To add new functionality without adding new code to existing classes, we need to add new classes we the desired functionality and make them conform to the existing interface. This way we don’t even have to recompile and redeploy existing code.

In the case of our BookQuery functionality, we could address this problem by adding an interface that requires the implementation of a method to get a book’s price without any arguments.

public interface BookQueryInterface { public String getBookPrice(); }

We then create two implementations of the book query class and supply the required arguments through the constructors when we instantiate the classes.

public class BookQuery implements BookQueryInterface { public BookQuery(String title) {}; public String getBookPrice() {}; }

public class BookQueryByEditionandType { public BookQueryByEdition(String title, String edition, String bookType){}; public String getBookPrice(){}; }

To add more flexibility, I would further split the last class into two separate classes. One of the is responsible for querying by edition, and the other one for querying by type. So in total, we now have three classes.

public class BookQueryByEdition { public BookQueryByEdition(String title, String edition){}; public String getBookPrice(){}; }

public class BookQueryByType { public BookQueryByType(String title, String bookType){}; public String getBookPrice(){}; }

The developer works on a preexisting class that already has a method to obtain the book’s price by its title.

public class BookQuery implements BookQueryInterface { public String getBookPricebyTitle(String title) { //database access magic happening here return "price"; }; }

Instead of changing this method, we extend the class with another method that takes the edition and the book type as parameters.

public class BookQuery implements BookQueryInterface { public String getBookPricebyTitle(String title) { //database access magic happening here return "price"; }; public String getBookPricebyTitleEditionType(String title, String edition, String bookType) { //database access magic using title, edition, and book type, happening here return "price"; }; }

In Java, we can use method overloading to define methods with the same name, but different parameter signatures. The parameter signature already tells us what arguments are required to find the book and the compiler will enforce the correct usage of arguments. This means we can simplify the naming of our methods by removing the parameters from the method name.

public class BookQuery { public String getBookPrice(String title) { //database access magic happening here return "price"; }; public String getBookPrice(String title, String edition, String bookType) { //database access magic using title, edition, and book type, happening here return "price"; }; //private methods for database access.. }

To keep our software more maintainable, I highly recommend defining an interface for the book query class. This way, we can add different implementations of the book query functionality without disturbing the classes that depend on BookQuery.

public interface BookQueryInterface { public String getBookPrice(String title); public String getBookPrice(String title, String edition, String bookType); }

Lastly, we have the BookQuery class conform to the interface.

public class BookQuery implements BookQueryInterface { @Override public String getBookPrice(String title) { //database access magic using title happening here return "price"; }; @Override public String getBookPrice(String title, String edition, String bookType) { //database access magic using title, edition, and book type, happening here return "price"; }; //private methods for database access.. }

So far so good. We have added a method rather than modifying an existing method. So at a method level, we could say that we conform to the open/closed principle. But the basic building blocks of object-oriented software are classes. To make the code more modular, we want an unmodified interface and add new functionality by creating new classes conforming to that interface.

We start the refactoring by adding a constructor to the book query class that supplies the title. Then we change the first implementation of getBookPrice to rely on the class variable title and remove the second implementation of get bookPrice altogether.

public class BookQuery implements BookQueryInterface { String title; public BookQuery(String title) { this.title = title; } @Override public String getBookPrice() { //database access magic using this.title happening here return "price"; }; //private methods for database access.. }

This code will not compile because we first need to adjust our interface:

public interface BookQueryInterface { public String getBookPrice(); }

Now the interface is very simple, and we can implement the additional functionality to retrieve books by edition and type by creating new implementations of the BookQueryInterface. To make the code even more modular and flexible, I suggest creating a separate class each for retrieving books by edition and by type.

Here is the implementation of BookQueryByEdition:

public class BookQueryByEdition implements BookQueryInterface { String title; String edition; public BookQueryByEdition(String title, String edition) { this.title = title; this.edition = edition; } @Override public String getBookPrice() { //database access magic using this.title and this.edition happening here. return "title"; } }

And here is the implementation of BookQueryByType:

public class BookQueryByType implements BookQueryInterface { String title; String bookType; public BookQueryByType(String title, String bookType) { this.title = title; this.bookType = bookType; } @Override public String getBookPrice() { //database access magic using this.title and this.bookType happening here. return "title"; } }

The post What is the Open/Closed Principle: An Explanation with Examples in Java first appeared on Programmathically.

]]>The SRP is often misinterpreted to stipulate that a module should only do one thing. While designing functions only to have one purpose is good software engineering practice, it is not what the single responsibility principle states.

**In a nutshell, the single responsibility principle is one of the SOLID principles for developing clean and modular software architectures. The single responsibility principle states that a module or class should only be responsible to one actor.**

The SOLID principles exist to help engineers and architects compose software systems out of cleanly separated components. The main goal is to avoid the dreaded spaghetti monolith in which everything depends on everything else, and changes have ripple effects across the entire system.

Imagine a module A is responsible to more than one other module or external stakeholder, meaning that more than one actor directly relies on its functionality. If the requirements of one of these actors change, module A has to change in response. But the other actors will still require the functionality of the original module A. So you pile on new functionality and perhaps expand the interface of module A in response to the new requirements while still maintaining the old functionality and interface for the other actors.

Now imagine another one of the dependent actors has a change in requirements. It’s not hard to see that module A will soon grow into a huge code monster in order to satisfy the changing requirements of many actors. This process of continuously expanding module A may work for a while, but it increasingly becomes a nightmare to maintain and expand.

The single responsibility principle helps you avoid such a scenario. It is also often described as stating that **“a module should only have one reason to change**.” But this leaves room for an interpretation that may be too restrictive. Since a module changes in response to the requirements of other actors, it is sufficient to respond to one or more changing requirements, provided all changes come from one and only one actor.

Suppose we are tasked with building an application to manage a book store. In the book store application, we have a **BookQuery** class that has the following methods:

getBookPricebyTitle(title)

The clerk at the bookstore uses the first method to retrieve the prices of books at the customers’ requests. Now an inexperienced developer who has never heard of the single responsibility principle is tasked with adding functionality to the software that will allow the accounting team to retrieve the books sold in a specific timeframe and the money made through those sales.

Since the **BookQuery** class is already connected to the database, the developer thinks: “*That’s easy. I just need to add two more methods to the class*.”

getListOfBooksSold(date) getTotalPriceofBookList(booklist)

The first function simply takes a date and then queries the company database for all books sold after that date, returning a list of titles. The second function accepts that list as an argument. It then iterates through the list using a simple for loop, calling the **getBookPricebyTitle(title)** method for every title in the list and adding up the total price.

The developer reports that he has solved the problem. The class and its functions are being tested on the current database of books and deployed into production.

The accounting department uses the function to calculate the revenue of the book store. At the end of the year, the accounting department realizes that they have consistently overstated their profits and paid much more in taxes than they would have had to. The owners are extremely angry. What happened?

Due to inflation, the prices of books have increased continuously over the year. Price changes are immediately entered into the database so that the clerks report the up-to-date price to the customers.

The accounting department has relied on the same method as the clerks for retrieving the most recent prices. But the book may have been sold several weeks ago when it still had the old, cheaper price.

This issue is a direct consequence of ignoring the single responsibility principle. By adding the two methods, the class has become responsible to the clerk as well as the accounting department. Once the requirements between the two actors diverged, a problem ensued.

Software architects often emphasize the importance of separating concerns. As the name implies, you want to ensure that the responsibilities of different modules are cleanly separated. In the example of the bookstore above, finding books in the library and accounting for the books sold are two different concerns. Accordingly, they should be handled by different classes and potentially by different modules. For example, you could create an **accounting **module and a **book search** module. The **accounting** module queries the database separately from the **book search** module and retrieves the historical data it needs to calculate the total value of books sold.

The developer above clearly violated the separation of concerns in his initial implementation. But by applying the single responsibility principle, he was forced to think about how to separate the code responsible to the two actors that were using it. By ultimately putting the two functions used by the accounting department into a new class, he automatically moved towards a better separation of concerns.

**In other words, the single responsibility principle provides guidance during the development process that enables you to achieve a separation of concerns at a system level.**

To see how the single responsibility principle works in practice, we use our previous example of the book store and write an implementation in Java.

First, we need to declare the BookQuery class that our developer is going to extend in violation of the SRP.

public class BookQuery { public String getBookPriceByTitle(String title){ return "price"; }; //private methods for database access.. }

Now our developer gets to work adding the functionality required by the accounting department. The accountants want to be able to retrieve a list of all the books sold, along with the total price of all the books. The developer adds two corresponding functions to the BookQuery class. The getListOfBooksSoldBetween function takes two date strings as an input, goes to the database, and retrieves all books that were sold between the two specified dates.

Then the list of books gets passed to the getTotalPriceofBookList function, which determines the current price of every book and adds those prices all up.

public class BookQuery { public String getBookPriceByTitle(String title) { //database access magic happening here return "price"; }; public String getListOfBooksSoldBetween(String beginningDate, String endDate) { //database access magic happening here return "date"; }; public String getTotalPriceofBookList(ArrayList<String> booklist) { //database access and price calculation magic happening here return "totalprice"; }; }

As we’ve seen above, this structure would lead to potentially unforeseen consequences and is a violation of the single responsibility principle.

We realized in the section above that the prices that the clerk retrieves may be different from those that the accounting department needs. The database should have a table with the books available in the store reflecting current prices and a different table of books sold that reflects the historical prices.

In our code, we should use different classes to access the different tables. First of all, we remove the offending methods from the BookQuery class. I’ve also moved the BookQuery class into a new Java package, “BookSearch” to indicate that it is in a different module.

package BookSearch; import java.util.ArrayList; public class BookQuery { public String getBookPriceByTitle(String title) { //database access magic happening here return "price"; }; //private methods for database access.. }

Then we create a new class which I’ve called “RevenueCalculator.” This class will host the methods removed from the previous class. It’s in a module called “Accounting.”

package Accounting; import java.util.ArrayList; public class RevenueCalculator { public ArrayList<String> getListOfBooksSoldBetween(String beginningDate, String endDate) { //database access magic happening here ArrayList<String> books = new ArrayList<String>(); return books; }; public String calculateTotalBookRevenue(ArrayList<String> booklist) { //database access and price calculation magic happening here return "totalprice"; }; //private methods for accessing database. These methods should access // a different database table compared to the Book Query table. }

You may have realized that I’ve renamed the second method to something I find more appropriate. After all, the accounting department is not interested in the **price** of the books but **revenue** the store made from selling books in the past. This class will also host several private methods for accessing the database. But these will access a different table or set of database tables compared to the first.

By separating the code so that every module is responsible to only one actor, we achieve a cleaner separation of responsibilities. This separation helps us prevent unforeseen interdependencies that can lead to errors down the line once a system is already running in production.

The post The Single Responsibility Principle first appeared on Programmathically.

]]>Learning the required mathematics is often perceived as one of the biggest obstacles by people trying to get started in machine learning. Mathematical concepts from linear algebra, statistics, and calculus are foundational to many machine learning algorithms.

Luckily, the past several years have seen the proliferation of several online courses and other learning resources. Thus, mastering the mathematical prerequisites has become a lot easier, especially for self-learners. Now the question is not so much how to learn the requisite math, but what course should I choose to do so?

This post is designed to help you understand what to look out for when picking a good math course and how to choose the best course for your requirements. In the course of this article, I will recommend and briefly review three online courses that I found most valuable for getting the mathematical foundation in place. With two of these three courses (Coursera specializations), you should learn all the math you need for a successful career in machine learning.

If you don’t want to read through the whole article and just want a recommendation for the best math courses, here’s the takeaway:

The best math course in my opinion is the “Mathematics for Machine Learning” specialization on Coursera. It has the right combination of breadth and depth while still being accessible to an audience who hasn’t studied math beyond high school. It also isn’t too long. Within 3-6 months you should be able to get the mathematical foundations you need to successfully understand machine learning algorithms. It doesn’t cover statistics in sufficient depth, though. To get a good grounding in statistics with a sufficient amount of programming practice, I recommend taking the “Statistics With Python” specialization. To get the best learning experience, take the first two courses of the “Mathematics for Machine Learning” specialization followed by the “Statistics with Python” Specialization. Optionally, you can then still take the third course from the “Mathematics for Machine Learning” specialization on principal components analysis.

Structurally, a good math for machine learning course should offer a good balance between theory and practice. When it comes to understanding abstract mathematical concepts, most people will not benefit from hours of passively listening to lectures. A teacher can explain something in 10 minutes, but it may take significantly longer for the listeners to process what has been said. And you probably won’t internalize and really understand it unless you’ve had to actively solve a problem involving those concepts.

Most people who learn math for machine learning want to become productive machine learning engineers and data scientists as quickly as possible. You probably don’t want to spend years building mathematical foundations before you even get started with machine learning. Accordingly, a course should focus on the essentials that are necessary to hit the ground running. Naturally, everyone learns differently. But as a rule of thumb, a good course should be able to convey the mathematical essentials necessary to become a productive machine learning engineer within 3-6 months of focused part-time study.

Therefore, lectures should be kept short and to the point. As a rule of thumb, I’d say a lecture video shouldn’t be longer than 15 minutes. If the concept is too complicated to fully explain in 15 minutes, a good instructor would break it down into smaller chunks across multiple videos.

After 2-3 videos, I’d expect some hands-on practice in the form of problems to solve with pen and paper. Every large conceptual block should be followed by a hands-on programming assignment that auto grades your results. Stay away from courses that only offer peer grading as this is ineffective for math courses. You want your mistakes to be detected by an algorithm or by a professional in the field, not by a peer who is as much a beginner as you are.

In terms of content, you want to look for courses that have linear algebra as a major building block. Machine learning heavily leans on matrices and vector operations. Accordingly, matrix multiplication and dot products should definitely be part of the course. The course should also contain a discussion of special matrix forms like inverse and identity matrices, transposes, and basis changes. Lastly, I’d also like to see a discussion of Eigenvectors, Eigenvalues, and data processing reduction techniques such as the Singular Value Decomposition.

I’d like to see an introduction to basic probability concepts including random variables, probability mass, and probability density functions, conditional probability and Bayes’ theorem.

Furthermore, the course should build an understanding of expected values, variance, covariance, and correlation. It should also familiarize you with common statistical distributions. The most important distributions to understand are the normal distribution and the Binomial distribution.

Lastly, the concept of likelihood and its difference from probability is also vital to understand.

Calculus is important once you get into more advanced machine learning concepts. Differential calculus is necessary to understand how neural networks backpropagate information. Machine learning relies on probability distributions to model the distribution of data. Integral calculus is necessary to calculate the expected value of continuous data distributions.

A good course should therefore cover first and second-order derivatives in matrix form as well as basic integration. I’d also like to see practical applications to backpropagation and gradient descent.

While calculus is very helpful, you can become a successful applied machine learning engineer without knowing more about it than what you’ve learned in high-school. Linear Algebra and statistics are the two main mathematical pillar that carry machine learning. Calculus is like the third pillar in the middle that reinforces the foundations but is not strictly necessary.

The math for ML specialization on Coursera consists of 3 courses each of which is 4-6 weeks long. Each course contains several video lectures that are delivered in 5 -15 minute segments. This way, you don’t lose focus and motivation as easily as if you were sitting through hour-long lectures. The lectures are supplemented with quizzes that usually require you to solve a math problem with pen and paper. At the end of each week, you’ll find programming assignments that reinforce your understanding.

The first course focuses on Linear Algebra, while the second course is all about multivariable calculus. The last course focuses on principal components analysis, a data reduction technique that heavily relies on linear algebra and thus builds on the topics of the previous courses.

One drawback about the mathematics for machine learning specialization is that it doesn’t have a dedicated statistics course. While there are several great online statistics courses out there, having a dedicated statistics course would have made the mathematics for machine learning specialization the best and most complete resource for learning machine learning and data science math.

- Short on-point lectures
- Hands-On Exercises
- Programming assignments
- Can be finished in 3-6 months

- No dedicated course on probability and statistics

In summary, the mathematics for machine learning specialization comes with all the attributes that you would expect of a good math course. Lectures and the courses themselves are of reasonable length. The video lectures are supplemented with hands-on practice in the form of programming assignments and exercises.

Statistics with Python is a specialization that teaches you statistics through applied Python programming. It features short lectures followed by quizzes and programming labs. The focus is always on getting you to solve problems through Python. If you do not have a Python or even a programming background the course has a double benefit. It will help you develop coding skills while teaching you statistics.

The specialization consists of 3 courses that build on each other. The first course covers a lot of foundations such as different types of data and how to sample from that data, and how to manipulate, explore, and visualize it in Python. The second course goes more into statistical techniques such as significance testing, and confidence intervals. In the third course, you find an introduction to statistical models such as linear regression and logistic regression. This course is a great stepping stone to machine learning because you learn to build basic models in Python, which is also the main language used for building machine learning models.

- Short on-point lectures
- Programming exercises
- Can be finished in 1-3 months

- I can’t think of any. Overall it is a great statistics course

Data science math skills is a very basic math course. It is only 4 weeks long and really designed for people who have forgotten most of the math they’ve learned in high school. The first weeks focus on mathematical notation and functions while the latter weeks familiarize you with logarithms and basic statistics. The course is split into several short video lectures so that you won’t feel overwhelmed. Each couple of videos is followed by a quiz that helps reinforce the concepts you’ve learned. Unfortunately, there are no programming assignments.

- Short on-point lectures
- Hands-On Exercises
- Can be finished in 1 month

- No programming assignments
- Very basic. Not enough depth.

I would call the course a stepping stone to a more in-depth machine learning math course.

The course is especially useful for building mathematical intuition. For example, an intuitive understanding of logarithms, exponentials, the cartesian coordinate system, and the rate of change helps tremendously in grasping non-linear functions, vector algebra, and calculus.

If you don’t have an intuitive understanding of these concepts, you would probably benefit from taking “Data Science Math Skills”.

In summary, the “Mathematics for Machine Learning” specialization in combination with the “Statistics with Python” specialization should give you all the foundations you need to launch a successful career in applied machine learning. Optionally, take “Data Science Math Skills” before the two other specializations if you feel that your math is really rusty or that you generally lack mathematical intuition.

The post What is the Best Math Course for Machine Learning first appeared on Programmathically.

]]>The sliding window algorithm is a method for performing operations on sequences such as arrays and strings. By using this method, time complexity can be reduced from O(n3) to O(n2) or from O(n2) to O(n). As the subarray moves from one end of the array to the other, it looks like a sliding window.

First, we need to define a window that meets the requirements of the challenge. On an array or string, we can delimit a subsection called a window using two pointers – say, left and right.

Using the sliding window approach, we can perform actions on specific parts of linear sequences. An array’s “windows” consist of a series of continuous sequences. The window is slid over the array, as the name implies. The window gets slid further once some actions are taken on the elements within it.

You can enlarge the window by incrementing and decrementing the pointers that define the window.

A sliding window in computer vision refers to a rectangular region with a defined width and height that moves over an image.

Sliding windows play a crucial role in object classification, as they allow us to localize exactly “where” in an image an object sits. Utilizing both a sliding window and an image pyramid, we can recognize objects in photographs at various scales and places.

Normally, we’d use an image classifier on the window region to see if it contains an object of interest. By combining sliding windows with object classification, we can build classification systems for images that can identify objects of various sizes and positions. Despite their simplicity, these techniques are the foundation of modern neural network architectures for identifying objects in images.

**The following are some of the most important indications that a sliding window approach might be appropriate:**

- Your problem involves data structures such as arrays, lists, and strings. An image is basically a multidimensional array
- You want to find a subrange involving the longest, shortest, or goal values in that array or string.
- Conceptually, it revolves around ideas like the longest or shortest sequence of something that meets a specific requirement.

Consider the following array as an example:

[a, b, c, d, e, f, g, h]

A sliding window would pass over the array as follows.

**The sliding window technique operates according to the following steps:**

- Determine the required window size.
- Begin with the data structure’s first window.
- In a loop, slide the window by 1 and continue calculating the result window by window.

Here is an example of how the sliding window approach can be used to find the largest sum of all the successive integers in an n-item array of integers:

**Naive Approach: **Let’s look at an algorithmic implementation for solving this problem using the brute force approach. We begin at the first index and work our way up to the kth element. We do that for every single successive block of k elements. The outer for loop begins with the first element of the k-element block, and the inner or nested loop adds up to the k^{th} element of the block.

# O(n * k) solution for finding # maximum sum of a subarray of size k INT_MIN = -sys.maxsize - 1 # Returns maximum sum in a # subarray of size k. def maxSum(arr, n, k): # Initialize result max_sum = INT_MIN # Consider all blocks # starting with i. for i in range(n - k + 1): current_sum = 0 for j in range(k): current_sum = current_sum + arr[i + j] # Update result if required. max_sum = max(current_sum, max_sum) return max_sum # Driver code arr = [1, 4, 2, 10, 2, 3, 1, 0, 20] k = 4 n = len(arr) print(maxSum(arr, n, k)) # Output = 24

The time complexity of the preceding code is O(k*n) because it has two nested loops.

Using a sliding window of length n, we can solve the same problem in a linear time frame. If we start from scratch, the pane will be located at the very left, 0 units from the left. Slide the window over the array arr[] of size n and calculate the current sum of size k elements. Now, we push the window forward by a unit of distance, as shown. After each step, the next k items will be covered by the pane.

**Applying sliding window technique:**

- We use a simple loop to compute the sum of the k components under the window and store it in the variable window_sum.
- While moving through the array, we keep tabs on the maximum total found inside a window.
- To calculate the current sum of a block of k items, either sum up all the items under the window or subtract the first element from the previous block and add the last element of the current block.

Let’s look at a step-by-step example of how the window slides over the data set.

Consider an array arr[] = {5, 2, -1, 0, 3} and value of k = 3 and n = 5

*Here we have calculated the initial window total from index 0 onwards. The window sum is six at this point. Using the current window as the maximum sum, we may set the maximum sum to 6.*

*After that, we move our window by one unit. As a consequence, it now removes 5 from the window and adds 0 at the end. We deduct 5 from our new window sum before adding 0. Now, the new value of our window sum is 1. This window sum will be compared to the current maximum sum. We won’t adjust the maximum sum because it’s smaller than the current maximum.*

*This time we move our window by a unit index to get the new window sum of 2. Once more, we check if the current window sum exceeds the maximum sum reached thus far. The maximum sum is not changed because the current sum is smaller than the maximum.*

*This array’s maximum sum, thus, is 6.*

A code implementation of this approach would look like this:

# O(n) solution for finding # maximum sum of a subarray of size k def maxSum(arr, k): # length of the array n = len(arr) # n must be greater than k if n < k: print("Invalid") return -1 # Compute sum of first window of size k window_sum = sum(arr[:k]) # first sum available max_sum = window_sum # Compute the sums of remaining windows by # removing first element of previous # window and adding last element of # the current window. for i in range(n - k): window_sum = window_sum - arr[i] + arr[i + k] max_sum = max(window_sum, max_sum) return max_sum # Driver code arr = [1, 4, 2, 10, 2, 3, 1, 0, 20] k = 4 print(maxSum(arr, k)) # Output = 24

As you may have recognized, there is just one loop in our code. Our Time Complexity is linear O(n).

The sliding window is more of a method than a formula. Algorithms of all kinds can be enhanced using sliding windows. The sliding window method is useful for reducing the time complexity of certain problems. The method can be used to solve almost any problem that fits the requirement of being able to add items sequentially or in parallel into one variable. For example, it allows us to tackle problems like calculating the maximum sum of n numbers. Compared to the brute-force approach, the sliding window algorithm computes the current sum in fewer steps. We can use this pattern for other problems as well.

The post What is the Sliding Window Algorithm? first appeared on Programmathically.

]]>In this post, we will look at the major deep learning architectures that are used in object detection. We first develop an understanding of the region proposal algorithms that were central to the initial object detection architectures. Then we dive into the architectures of various forms of RCNN, YOLO, and SSD and understand what differentiates them.

Convolutional neural networks are essentially image classification algorithms. You can turn a CNN into a model for object detection by finding regions in an image that potentially contain objects and use the neural network to classify whether the desired object is present in the image or not. The brute force approach would be a naive sliding window algorithm where you take windows of different sizes, slide them over the image, and have the neural network classify whether there is an object or not. Unfortunately, this approach isn’t very efficient since you have the network produce thousands of classifications on every single image.

To speed up the process of finding areas of the image that potentially contain objects of interest, the computer vision research community has come up with a set of region proposal algorithms. As the name implies, these algorithms identify regions in the image that potentially contain objects of interest. The network then only needs to adjust those regions and classify the objects contained inside of them.

The earlier object detection networks such as RCNN used an algorithm called selective search to identify areas of interest. In a nutshell, selective search is a graph-based clustering algorithm that groups regions of interest on the basis of features such as color, shape, and texture. A selective search algorithm will propose a couple of thousand regions.

A region proposal network is a faster way to find areas of interest in an image and has been used in newer object detection networks such as Faster-RCNN. The basic idea of region proposal networks is to run your image through the first few layers of a convolutional neural network for object classification. A CNN for classification produces a series of shrinking feature maps through convolutional layers and pooling layers. It usually ends with fully connected layers that produce a classification. In a regional proposal network, you exclude the fully connected layers and use the feature map directly. Since the feature map will be much smaller than the original image, you can perform bounding box regression and object classification on much fewer windows.

The region proposal network is essentially a convolutional implementation of the sliding windows approach. For a more detailed explanation of the convolutional sliding windows approach, check out my post on the foundations of deep learning for object detection.

The region-based convolutional neural network was one of the first networks that successfully applied deep learning to the problem of object detection.

It works by initially applying the selective search algorithm to find region proposals. The proposals are then extracted and warped to a standard size so they can be processed by a neural network. Next, you apply a standard convolutional neural network to classify the images into one of several classes.

If we stopped there, we would potentially face the issue that the bounding boxes produced by selective search don’t match up with the objects we want to detect. The CNN is trained to classify specific objects, but the selective search algorithm doesn’t know which objects these are. To correct the region proposals from selective search, we also train the convolutional neural network to perform bounding box regression.

This means that the network performs a classification to detect whether an object of interest is present or not. In addition, it produces a transform to adjust the bounding boxes produced by the region proposal algorithm.

The regions proposal algorithm produces the x and y coordinates of the bounding box midpoint as well as the height h and width w of the box. The bounding box transform produced by the neural network is a set of 4 numbers {tx, ty, tw, th} that are multiplied with the original height and width values and added to the original coordinates to produce a translation of the original bounding box in the image coordinate system.

x_{new} = x + wt_x \;\;\;\; y_{new} = y +ht_y

The original height and width produced by the region proposal algorithm are scaled by multiplying them with the exponentiated value of the transform produced by the network.

w_{new} = w \times exp(t_w) \;\;\;\; h_{new} = h \times exp(t_h)

Since these are continuous numeric outputs, the network performs a regression. The regression part and the classification part require different loss functions that are subsequently combined in a weighted multitask loss.

Based on the multitask loss, you can train your neural network to produce the classifications and bounding-box transforms. As the last step, you need to filter through the boxes. Remember, the original region proposal algorithm produced thousands of images. There are several methods available for selecting the best bounding boxes. For a discussion of those methods, refer to my post on the foundations of deep learning for object detection.

A major drawback of RCNN is that it is extremely slow. After all, you have thousands of sub-images corresponding to region proposals, and you have to send them through the neural network individually. Extracting all the regions and sending them through the network for classification took tens of seconds per image. For real-time applications such as self-driving cars, this delay is unacceptable.

Furthermore, no learning is happening on the part of the region proposal algorithm since RCNN was using selective search, not a neural network.

Fast RCNN addressed these issues by running the image through a convolutional neural network before performing classification on the subregions. As discussed above, this network, known as the backbone network, extracts a feature map using a convolutional approach. Then you use a region proposal method such as selective search to find regions of interest in the feature map and use a convolutional neural network to classify those subregions. Instead of ending up with thousands of sub-images that have to be classified individually, you have very small features in the feature map. Classifying those features individually is much faster than classifying the sub-images produced by selective search. Due to the conventional operation performed before, we have a large degree of feature overlap/sharing between the features on the map. Bounding boxes are proposed on the basis of the spatial location of the feature inside the feature map.

While performing region proposals on a single feature map helped speed up Fast RCNN significantly, it still relied on selective search to find regions of interest. Faster RCNN managed to improve speed even further by using a region proposal network instead of applying selective search.

You only look once (YOLO) marks a break with the previous approach of repurposing object classification networks for object detection. Yolo breaks new ground by using a single fully connected layer to predict the locations of objects in an image, essentially requiring only a single iteration to find the objects of interest. This results in a massive acceleration at inference time compared to previous architectures like RNN.

The fully connected layer that produces the location of the objects of interest has the following dimensions:

number\;of \;grids \,\times\; anchor\;boxes\;\times\;( \;class\; +bounding\;box\;coordinates\;+\;num\;classes)

In the example above, we split the image into a 2 x 3 grid. We use two types of anchor boxes (one that is very high but not very wide, and another one that is wide but not very high). Then we have another neuron in the fully connected layer that expresses the confidence that the class is pr sent or not. The bounding box has 4 values (the width and height, as well as the x and y coordinates of its center). Lastly, we also have one neuron for every class that we are tryi g to detect. Since we only detect one class, we could theoretically do away with it, but let’s keep it for completeness’ sake.

Accordingly, you end up with the following dimensions.

2 \times 3 \times 2 \times ( 1 + 4 + 1)

The YOLO architecture looks very much like an image classification architecture. It has several convolutional and pooling layers that extract features from the image followed by two fully connected layers in the end they generate the final predictions.

The convolutional layers extract the grid cells and the fully connected layers then output the confidence scores that the object of interest is present. In practice, you end up with a probability map where each section of one or several grid cells of the image is associated with a probability that a certain object is present.

Of course, your network will produce a lot of bounding boxes initially. You get rid of them at training time by calculating the intersection over union with the ground-truth bounding box. At test time, you narrow down your choice of boxes by eliminating those associated with a low probability. Ultimately, you use non-max suppression to generate the final bounding-box predictions.

THe speed advantage of YOLO comes from its ability to run the whole image through the network once, split the image into grid cells, and predict whether a object is present or not. An disadvantage of this approach is that each grid cell can only predict one class. Accordingly, it loses accuracy when you have to deal with multiple tiny objects in the same location.

To correct the shortcomings of YOLO, computer vision researchers presented SSD in 2015. Like YOLO, SSD relies on a convolutional backbone network to extract feature maps. But it doesn’t rely on 2 fully connected layers to produce the bounding boxes. Instead, it stacks several convolutional layers that produce progressively smaller feature maps. At each size, the network produces a score for each grid cell to determine how well the cell matches the desired object. This allows SSD networks to scale the bounding box to the size of the desired object and thus detect objects of different sizes. Compared to YOLO, SSD is more accurate because of its ability to produce bounding boxes at different scales. Of course, it also produces a much larger number of bounding boxes resulting in slight losses in speed compared to YOLO. Nevertheless, SSD is still orders of magnitude faster than the original RCNN architectures.

The RCNN family constituted the first neural network architectures in the deep learning era for object detection. RCNNs combined traditional, graph based algorithms for region proposal with neural networks for object classification. While they delivered good results, the first generations were extremely slow.

Further RCNN iterations significantly improved detection speed by introducing neural networks and convolutional operations to handle region proposals.

YOLO significantly improved detection speed by passing input through a neural network only once. It relied on fully connected layers to produce both object classifications, as well as the locations of the bounding boxes.

SSD corrected some of the shortcomings of YOLO by producing bounding boxes at different scales.

The post Deep Learning Architectures for Object Detection: Yolo vs. SSD vs. RCNN first appeared on Programmathically.

]]>In this post, we will cover the foundations of how neural networks learn to localize and detect objects.

Object localization refers to the practice of detecting a single prominent object in an image or a scene, while object detection is about detecting several objects.

A neural network can localize objects in an image by learning to predict the coordinates of a bounding box around the object. A bounding box is simply a rectangular box around the object of interest. The network first performs a classification to predict whether the object of interest is present in the image. In addition to the classification, the network outputs the x and y coordinates of the center of the bounding box, as well as the width and height.

A neural network learns to detect multiple objects in an image using sliding windows. The sliding window algorithm slides multiple overlapping windows over the image of interest and detects whether an object of interest is present in the current area under the window. The network then relies on a metric called intersection-over-union to pick the best box and non-max suppression to discard boxes that are less accurate.

In a neural network trained for image classification, your last layer will use a sigmoid or softmax loss function to distinguish between two or more classes. The result will be a probability that the object belongs to a specific class. When performing object localization, we can extend a classification model by adding neurons to the network that also predict the location of the object of interest in the image.

The standard way of representing the location of an object in the image is to draw a bounding box and have the neural network predict the x and y coordinates of the center of the box as well as its width and height.

To teach the neural network to classify and localize the object, you have to provide it with the original images and the coordinates, height, and width of the bounding box.

The network now has to perform two very different tasks at once. Determining the type of object is a classification task. Predicting the bounding box is a regression task because you need to quantify the difference between the predicted and the actual coordinates and box measurements.

As a consequence, you also need two different loss functions that you subsequently unify in a combined loss known as multitask loss. For the classification task, you use a loss function such as cross-entropy, while for the regression task, you use L2 loss or the intersection over union (a special kind of loss function for bounding boxes which we will cover in a later section).

As the name implies, the sliding window algorithm slides a window over some input array and applies an operation to the content under the window.

A classic example where the sliding window technique is useful would be finding the greatest sum of 2 consecutive elements in an array. To find that sum, we slide a window that encompasses exactly two numbers over the array and add the numbers together.

[1,2,3,4] Window 1: [1,2] -> 3 Window 2: [2,3] -> 5 Window 3: [3,4] -> 7

The result equals 7.

In the case of images, we can use the sliding window to detect whether a feature or an object is present under the window.

Since an image is a 2-dimensional array, we have to slide the window horizontally and vertically. Each sub-image contained in a window is fed to a convolutional neural network which is used to classify whether an object of interest is contained in the image or not.

To increase the probability of finding an object of interest, you want to keep the stride (the size of the step between boxes) as small as possible. If you only advance your box by one pixel in each step, your stride would equal 1. The obvious disadvantage of this approach is that you will end up with almost as many boxes as there are pixels in the image. The computational cost would be prohibitively high, making this approach infeasible.

In the illustration above, we use a stride of 1 and a window size equivalent to the image size minus 1. As a consequence, we end up with 4 sliding windows. This is just for illustrative purposes. In a real implementation, the sliding window will likely be much smaller than the image, and you’ll end up with a lot more windows.

You can make this implementation a lot more efficient by making the sliding window convolutional. Instead of just applying a convolutional neural network to the single windows, you make the window extraction convolutional as well. The basic idea is that you apply a convolutional filter operation to the content under the sliding window and generate a feature map. If you are not sure how this works, check out my post on convolutional filters.

The convolutional operation will generate a feature map that is smaller than the original image by a factor of (filter size – 1). If we have a 10×10 input image and apply a 5×5 convolutional filter to it, the result should be a 6×6 feature map.

By applying Pooling with a 2×2 filter, we can cut the size of the feature map in half again to 3×3. After another convolution operation with a 2×2 filter, we end up with a 2×2 feature map.

Each of the 4 windows in the feature map should now correspond to the 4 sliding windows in the original 10×10 image. Just like in a convolutional neural network, we can feed the outputs of the feature map to a fully connected layer and perform classification on it.

Instead of running 4 neural networks on each individual subwindow in succession, the convolutional implementation allows us to classify all windows in parallel through one neural network.

Once the neural network has detected a bounding box, we need a way to evaluate how well the predicted bounding box corresponds to the ground truth bounding box. The method for quantifying the difference is known as intersection over union. **The intersection over union measures the overlap between the areas of the predicted box and the ground-truth box. The formula, as the name implies, is a simple division of the area of the union of the two boxes, divided by the intersection of the two boxes.**

IoU = \frac{groundtruth\; \cap \;predicted }{groundtruth\; \cup \;predicted}

This formula gives you a ratio between 0 and 1. The larger that ratio, the more your predicted box overlaps with your ground truth box, and thus the more accurate your prediction. If you have several bounding boxes, the IoU enables you to choose the best box by picking the one with the highest ratio.

Conventionally, you want to aim for an IoU of at least 0.5.

When your sliding window moves over the image in small steps, it is likely that multiple windows will end up detecting the same object. It would be wasteful and confusing to produce multiple boxes therefore we rely on a technique known as non-max suppression to suppress all boxes except for the best.

When generating bounding boxes, the neural network will produce the coordinates of the box, as well as a confidence score that the box actually contains the desired object.

After the neural network has generated the bounding boxes, you use non-max suppression in the following way:

- Select the bounding box ‘a’ with the highest confidence score.
- Compare the proposal ‘a’ with the highest confidence with every other proposal by calculating the intersection over union.
- If the overlap between a and the other proposal is higher than a chosen threshold, remove the other proposal.
- Next, you choose the box with the highest confidence score out of the remaining boxes and repeat the process until no more boxes are left.

This process ensures, that for one area containing an object, there is only one significant bounding box.

A major problem with the sliding-window based object detection mechanism is that each box can only contain one object. Anchor boxes allow us to detect multiple objects per window. To achieve the goal of multi-object detection, we need to define multiple anchor boxes depending on the number of objects we expect to detect. We also define the boxes with different dimensions to distinguish between objects of different shapes.

In the example above, we have two anchor boxes. This means you need to get your network to produce the classification for both objects in addition to the height, width of the box, and x, and y coordinates of the midpoint.

Now of course you may run into the problem that you have several anchor boxes assigned to a single object. To address this issue, you rely on the intersection over union again to calculate the overlap between the predicted bounding boxes and the ground truth.

Furthermore, the network detects the midpoint of the bounding box by finding the window that it deems to be closest to the midpoint of the object, and using the window’s midpoint. In rare cases, you may run into the problem that two bounding boxes have the same midpoint. But this shouldn’t happen if you keep the size of the windows relatively small.

The post Foundations of Deep Learning for Object Detection: From Sliding Windows to Anchor Boxes first appeared on Programmathically.

]]>Machine learning has emerged as one of the hottest technology trends. Salaries for skilled machine learning engineers are through the roof, and many companies are unable to fill open positions in the field. Moving into machine learning, therefore, can be a very promising career move.

But how difficult is it to get a job in machine learning? What skills do you need, and how do you acquire them?

In this post, we develop a learning roadmap for anyone looking to become proficient in machine learning, attempt to answer all the important questions, and discuss the learning resources for self-study that should get you to your first machine learning job.

Here is the process for learning machine learning in a nutshell:

**Take an introductory machine learning course with hands-on practice labs**.

Coursera has a range of good machine learning courses, including Andrew Ng’s legendary ML course.**Build 2-3 pet projects**.

Find an open-source dataset online (Kaggle is a great source). Import and preprocess the data, and build a model using the tools and skills you’ve learned in the course. Try to tune the model and improve its performance by manipulating the data and the model hyperparameters.**Pick an area of specialization.**

At this point, you should have a decent basic understanding of machine learning. You probably also have an idea of what area you want to apply machine learning to. Based on your personal interest and goals, I recommend picking an area of specialization and learning the tools and techniques specific to that area.

For example, if you are interested in self-driving cars and robotics, you probably want to focus on deep learning for computer vision.

If you want to focus on natural language processing and language AI, you would also continue your learning journey into deep learning with a focus on sequence models and transformers. In both cases, you need to learn more about neural networks and at least one deep learning framework like TensorFlow or PyTorch.

If, on the other hand, you are more interested in business analytics, you are likely going to focus more on classical methods and statistical techniques such as linear regression.

**Take another online course in your area of specialization**.

Nowadays, there are tons of high-quality online courses in various areas of specialization. Coursera, Udemy, and Pluralsight are great places to find courses.**Build more pet projects****of increasing difficulty**

For example, if you intend to focus on computer vision, you could start building a vanilla convolutional neural network for classifying images into two classes. The cats and dogs dataset on Kaggle is a great starting point. Next, you can try to build a multiclass classification model and do transfer learning with a more advanced architecture such as ResNet. Lastly, try to build an object detection or image segmentation model or even a generative adversarial network. Generative networks are probably the most difficult to train.**Read Books and Papers in your area of specialization****and create implementations in code.**

Courses give you a general introduction to a subject, but they won’t give you the detailed knowledge required to build true expertise. That expertise is mainly built by solving problems you encounter through hands-on practice. To help you find good solutions to those problems and develop best practices, you will have to revert to papers, books, and online discussion forums.

Understanding technical research papers usually requires familiarity with mathematical notation and good foundations in machine learning. If you are able to take a moderately difficult paper and create a working implementation of the algorithm in code, it is a good indication that you have moved beyond the rookie stage. It is also a strong signal to a potential employer that you are able to understand and synthesize information at the cutting edge and derive practical use from it for your employer.

While papers present specific solutions to concrete challenges, I found that books often help develop conceptual knowledge and best practices to address those challenges.

While machine learning is a difficult subject to learn, most people can succeed with the right learning strategy and a good amount of discipline and patience. How hard your learning journey will be, also depends a great deal on the knowledge and skills you already bring to the table. A strong math background in linear algebra and statistics will go a long way toward a conceptual understanding of machine learning algorithms. Programming experience, especially in Python, will help with practical implementation and understanding of machine learning frameworks.

Even complete beginners can succeed in machine learning. Non-technical people can learn to apply machine learning algorithms without a working knowledge of advanced math. The only real prerequisite to getting started with applied machine learning is programming. If you don’t know how to code, I strongly encourage you to take at least an introductory Python course before starting your machine learning journey.

If you want to read research papers and understand machine learning research, I encourage you to build a foundation in linear algebra, statistics, and possibly calculus.

An ambitious beginner with no coding background can learn the foundations of programming and applied machine learning in 3-6 months.

If you are an experienced developer, you can probably learn to build simple machine learning models in a few weeks.

Mastering the subject and building the knowledge required of a senior machine learning engineer will take years of dedicated practice and experience.

Programming skills are a necessity if you want to build any kind of customized machine learning model. That situation might change in the future as no-code ML solutions are cropping up. I still doubt that these tools will ever allow the same kind of customization that programming languages enable.

The go-to language for most machine learning engineers is Python. It is also one of the easiest languages to learn. Whether you don’t have any programming skills to start with or whether you are an experienced developer in another language, picking up Python is definitely a good idea if you are interested in machine learning.

Math is often a major obstacle for people who want to break into AI and machine learning. The good news is, that applied machine learning requires significantly less math than research. You should be familiar with vectors and matrices and how to add, subtract and multiply them. Some basic statistics, including random variables, variance, expected values, and normal distributions, should also be part of your toolkit.

Before you learn about specific algorithms, you should build a foundational knowledge of machine learning. This includes an understanding of the different types of machine learning, the bias-variance tradeoff, as well as regularization methods.

The first learning algorithms that most people encounter are linear regression and logistic regression. Understanding how they work really sets the stage for a better understanding of more advanced machine learning algorithms such as neural networks.

SciKit Learn is a Python package mainly used for classical machine learning algorithms and data analysis.

Learn to apply linear and logistic regression in SciKit Learn. Import and analyze some spreadsheet data.

Before you can build a model, you need to retrieve the data. In many cases, it won’t be in a folder on a local hard drive.

Often that data will be stuck in a database. Querying relational databases, getting the data you need, combining, and manipulating it with SQL is a skill every machine learning engineer should have.

You may also have to retrieve the data from an online location via API call. Querying APIs and dealing with asynchronous data flows is any other vital skill.

Most of the data you will encounter in the real world won’t be neatly organized in spreadsheets. One of the most important tasks of a data scientist or machine learning engineer is to bring that data into a form that is suitable for model building. This process is known as data preprocessing.

As a machine learning engineer, you may also be in charge of automating the data preprocessing by building data pipelines. In fact, data retrieval, data processing, and building software to automate the processing of data have become such an important and large component of the data science and AI sector that it has spawned a new role: the data engineer. In some organizations, you may have the luxury of having a data engineer who takes care of data processing for you. But it is still an invaluable skill for any machine learning engineer.

Now you finally get to actually building and applying machine learning models. At this stage, you may pick an area of specialization.

For business analytics and data mining (the largest sector), you want to develop broad experience with many statistical and classical machine learning techniques. In recent years, a few methods, such as extreme gradient boosting, have come to dominate classical machine learning methods, but other algorithms such as decision trees or Naive Bayes classifiers are still very valuable.

In this field, data visualization and communication are also especially important. Your job is to gain insights from data and communicate those insights to other people, such as executives and clients. Those people often don’t have a lot of time and want the information to be as easily and intuitively digestible as possible.

If you are interested in using machine learning to build computer vision or language processing applications, you want to focus on deep learning and neural networks.

You start by learning about neural network architectures and how to build them using frameworks like TensorFlow and PyTorch. At this point, it also makes sense to review some calculus and linear algebra because neural networks build on matrices and differential calculus.

For a focus on computer vision, you need to become familiar with convolutional neural networks. Once you understand the basic architecture of convolutional nets, familiarize yourself with common architectures and transfer learning. Depending on what you want to do in computer vision, you can further explore topics like image and video classification, object detection, segmentation, or even image generation. All of these areas have their own dedicated neural network architectures that usually build on convolutional neural networks.

If you are interested in applying machine learning techniques to areas of language processing, such as machine translation and natural language understanding, you will focus on sequence models and transformers. You also have to become familiar with concepts such as word embeddings, LSTMs, and the attention mechanism. These models and mechanisms allow you to understand language in context and create pretty accurate translations and interpretations.

Reinforcement learning is one of the most advanced areas of machine learning that is at the heart of the robotics revolution. I wouldn’t recommend focusing on reinforcement learning from the start. Firstly, as of this writing, RL isn’t as mature as deep learning and classical machine learning. Accordingly, jobs aren’t yet as widespread and are mostly confined to academic research labs.

If you do want to go down the Reinforcement Learning route, you’ll need a solid foundation in math. Topics from statistics and probability theory such as Markov chain Monte Carlo methods are especially important. Prior background in deep learning gives you a good foundation for deep reinforcement learning.

The rise of machine learning has been accompanied by the emergence of a plethora of study resources such as online courses, books, and tutorials. Every learner is different, and resources that work for one may not work for another. In this section, I’d like to briefly discuss how to choose the right learning resource for your personal needs.

As a beginner who doesn’t know anything about machine learning, it is hard to decide what to learn and which topics to prioritize. I, therefore, believe that beginners are best served with a course (online or at a university) that takes you by the hand and teaches you the foundation through lectures and hands-on exercises. There are also several hands-on books that offer an alternative to courses for those who prefer reading over watching videos.

Regardless of whether you choose a book or a video course, it is crucial that your learning resource is accompanied by labs and coding challenges.

The same principle also applies when you are diving into a subfield of machine learning for the first time, such as deep learning or reinforcement learning. Take a course or read a hands-on book to get an overview of the field as quickly and painlessly as possible.

Once you have gone beyond the beginner stage in machine learning or a subfield of it, hands-on practice should become your main learning source. But while hands-on practice is necessary to build true expertise, it is not sufficient. The very best engineers deliberately use every project as an opportunity for self-improvement. How can you do the same thing?

While you are building machine learning systems, strive to continuously improve your process and deepen your understanding of the problems you encounter. Each time you’ve overcome a challenge, don’t just move on but recall what you’ve learned and actively try to come up with a process that would have made the problem easier to solve.

Most importantly, be humble and always assume that your own implementation is suboptimal and that there are more experienced people out there who would find a better solution.

Even if you have found a solution that technically works, scout the literature for best practices. Perhaps you find a paper, a section in a book, or even a forum entry that describes how to best handle a similar situation. Based on the information you’ve found, improve your own solution.

**A Note on Reading books and Papers**

Reading and understanding books on machine learning theory can seem overwhelming. After all, many pages are densely plastered with advanced math or code. But books often contain those hidden insights and deeper conceptual understanding that you may not find in a tutorial or an online course.

So I encourage you to read books and papers on machine learning as soon as you have established a foundational knowledge. Over time and with practice, you’ll build a deep repository of knowledge and intuition that will put you on the road to becoming a true expert.

The post How to Learn Machine Learning: A Guide for Self-Starters first appeared on Programmathically.

]]>TensorFlow is one of the two dominant deep learning frameworks. It is heavily used in industry to build cutting-edge AI applications. While its rival PyTorch has seen an increase in popularity over recent years, TensorFlow is still the dominant framework in industry applications. Most machine learning engineers, especially deep learning engineers, are well-advised to learn how to use TensorFlow.

There are several ways to learn TensorFlow. In this post we discuss the best ways to learn TensorFlow and point you to corresponding resources for learning. For those of you who don’t want to read a long guide, here is the most efficient way right away:

**In my experience, the most efficient way to learn TensorFlow in terms of time and money is a collection of good online courses, and the official TensorFlow tutorials. Online courses guide you through the most important concepts. This saves you a lot of time that you would otherwise use on researching concepts and moving through disparate tutorials on the web. Most of the courses have programming tutorials that give you lots of hands-on practice. Online courses also cost only a fraction of the money a course at a university or coding BootCamp would cost.** **Ideally, you supplement the online course with your own hands-on projects.**

Up to an intermediate level, TensorFlow is not difficult to learn if you have a foundation in Python programming and you rely on high-level APIs such as Keras. A foundational knowledge of machine learning and neural networks in particular also helps tremendously.

An intermediate-level knowledge should be sufficient for entry-level machine learning jobs.

While there are study materials for more advanced applications of Tensorflow, you will only master advanced concepts through years of hands-on practice. These years of practice are usually required for senior machine learning jobs.

In general, a beginner can learn Tensorflow even if she is completely new to machine learning. The only real prerequisite is some programming knowledge, preferably in Python. If you are a complete beginner with no knowledge whatsoever of machine learning, the learning curve is going to be significantly steeper. Getting a solid foundation in machine learning and Python programming before you attempt to start learning TensorFlow will certainly make your learning journey smoother.

In the next sections, we will discuss in detail what that foundation consists of.

If you already know Python programming and the theoretical foundations of neural networks, you can become a productive TensorFlow developer in 1 to 2 months. If you are a complete beginner in machine learning and programming, 3-6 months is a more realistic timeline.

Building deep expertise in TensorFlow takes several years of applied practice.

To learn TensorFlow as efficiently and quickly as possible, you need to strategically structure your learning roadmap so that concepts build on one another. In this section, we look at the main TensorFlow concepts in the order I recommend learning them for maximum benefit.

I assume that you don’t just want to use pretrained models, but you want to learn to build and train machine learning models, build ML pipelines, and put models into production using TensorFlow.

Before you even start studying TensorFlow, you should be familiar with the Python programming language. Strictly speaking, you could also use the framework in C++ or Javascript, but Python is the dominant language for machine learning. In addition, most courses and resources you’ll find online will be based on Python.

Understanding basic Python syntax, variables, data structures, loops, and functions should be enough for the beginning. But once you get into the advanced TensorFlow APIs, an understanding of object-oriented concepts such as classes, abstraction, encapsulation, inheritance, and polymorphism in Python will be necessary.

Since TensorFlow is a deep learning framework, it helps a lot to have a conceptual understanding of neural networks. While you do not have to know the calculus behind backpropagation, you should know the basic building blocks of neural networks.

**Data Dimensionality and Matrix Algebra:** You will feed data as matrices or multidimensional tensors to your TensorFlow models. Understanding the data dimensions of your data and how they change is crucial for building models in TensorFlow. You should also understand basic mathematical operations from linear algebra, such as matrix addition, subtraction, and dot products.

**Types of Layers:** Understand the types of layers that are used in a network and how they affect the data dimensionality

**Activation functions:** Know which activation function is appropriate depending on the layer and the type of problem.

**Cost Functions:** Know which cost function is appropriate depending on your data.

Other important concepts, including optimizers and advanced layer types such as convolutional layers, are not necessary prerequisites. In the beginning, default implementations for optimizers are usually enough, and you can learn about advanced deep learning concepts after you have mastered the foundations of TensorFlow.

Keras is an API on top of TensorFlow that significantly simplifies TensorFlow snytax. Using Keras does not preclude you from using raw TensorFlow in the same code. Accordingly, I recommend using Keras wherever you can and only using TensorFlow for functionality that Keras doesn’t handle. That way, you will be productive much more quickly.

The sequential API allows you to quickly throw together and train a model. If you rely on readily prepared datasets provided by TensorFlow, you don’t need to learn the mechanics of data preprocessing first. Data preprocessing is usually more frustrating which is why I recommend learning to train models with the sequential API first.

Sooner or later, you want to train a model on your own data. In almost all cases, your raw data won’t be in a state where you can simply import it and go on to training a model. Tensorflow gives you a rich set of tools for manipulating and preprocessing data that can be leveraged to build automated data processing pipelines. Learn how to import external data. Kaggle is a good source for finding datasets that you can manipulate with TensorFlow.

The functional API is much more powerful and allows you to customize models to a much greater extent than the sequential API.

At this point, it is also a good idea to become familiar with object-oriented Python as well as TensorFlow’s datatypes and arithmetic operators.

By that point, you have a very good foundation in TensorFlow. From here on, you can dive into more advanced areas such as distributed training, constructing data pipelines, generative deep learning, or reinforcement learning, depending on your interests and needs.

To learn as quickly as possible, I recommend beginners to find a good online course that has hands-on programming tutorials. You could also find an in-person BootCamp. But for most people, this is not an option due to location, time, and money constraints which is why I won’t discuss this option here.

As you become a more advanced TensorFlow user, practice in the form of real projects is key. Ideally, you find a job that lets you apply your TensorFlow skills. But if that is not possible, unique private projects are very valuable as well.

All other tools such as books, tutorials, videos, and the official documentation are helpful supplements to a solid online course and real-life projects.

When it comes to online courses, there are several options. The main players are Coursera, Udacity, and Udemy.

**Coursera** offers several TensorFlow courses from beginner to advanced. In my opinion, Coursera offers the best overall course experience in terms of quality and price. The courses are created by DeepLearning.ai, Andrew Ng’s AI education company. Most of them feature Andrew Ng and are taught by a Google TensorFlow developer advocate. The courses come with video lectures, handouts, and hands-on programming labs that are auto-graded.

I recommend starting with the “**DeepLearning.AI TensorFlow Developer Professional Certificate**“. It is a series of 4 courses that will take you from complete beginner to an intermediate understanding of TensorFlow. The first course is a general introduction, while the later courses focus on intermediate concepts such as building convolutional neural networks and sequence models.

After finishing the professional certificate, you should get some hands-on experience building 2-4 end-to-end projects in TensorFlow. Just get some data from Kaggle and try to build a model around it. Next, take the “**TensorFlow: Advanced Techniques Specialization**” also by DeepLearning.AI. Upon finishing this one, you should have enough knowledge to hit the ground running in a developer role that involves TensorFlow.

**Udemy** offers a lot of courses related to TensorFlow, which makes it difficult to choose a particular one. The main problem with Udemy is that anyone can create a course there, so you have no guarantee for quality. Furthermore, most courses are purely video-based and lack interactive labs that are absolutely crucial to learning. For these reasons, I will not recommend any specific courses on Udemy here. This doesn’t mean there aren’t good TensorFlow courses on Udemy.

**Udacity** offers Nanodegree programs that aim to make you proficient in a technology over the course of several months with intensive study. They also offer a nano degree called “**Intro to Machine Learning with TensorFlow**“. You will do lots of hands-on projects that will be reviewed by experienced reviewers that provide personalized feedback. I like the concept as it is probably the closest thing you get to an in-presence boot camp. Unfortunately it comes with a steep price tag that is orders of magnitude higher than what you get with Coursera or Udemy. The last time I looked, the TensorFlow nano degree cost more than 1000 USD for 3-month access.

There are some free Tensorflow programs on Udacity as well but, somewhat understandably, they have nowhere near the depth that you get with a paid nano degree.

Books are great additions to online courses and to building TensorFlow-based projects. Books often go more in-depth than courses and offer insights that you can’t even find in the official documentation.

**Deep Learning with Python** is a comprehensive introduction to deep learning written by Francois Chollet, the creator of the Keras API. Accordingly, you will find lots of great information about using Keras and TensorFlow in practice in his book. I highly recommend getting this book, especially if you are new to deep learning. It is a great companion for any TensorFlow based deep learning project.

**Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow** is another comprehensive introduction to machine learning that focuses on TensorFlow. The first part focuses on classical machine learning, so if you are completely new to machine learning, this is a great book to get. The second part offers an introduction to deep learning with practical implementations in Keras and TensorFlow.

The best resource for most questions will be the official TensorFlow documentation and the guides provided by the TensorFlow team which can be found here.

**Machine Learning Mastery** has lots of good hands-on tutorials related to Keras and TensorFlow.

The post How to Learn TensorFlow Fast: A Learning Roadmap with Resources first appeared on Programmathically.

]]>In this post we will build the mathematical knowledge for understanding the Fourier Transform from the very foundations. In the first section we will briefly discuss sinusoidal function and complex numbers as they relate to Fourier transforms. Next, we will develop an understanding of Fourier series and how we can approximate periodic functions using sines and cosines. Lastly, we investigate how we can use all the previously introduced concepts to approximate non-periodic functions and to transform them to the frequency domain using the Fourier Transform.

**The Fourier transform describes a process of transforming a signal from a representation in the time domain to a representation in the frequency domain. The Fourier transform thus allows us to decompose a signal into its component frequencies. It is applied to a wide variety of fields such as image and sound processing (light and sound signals), signal analysis, and the design of electrical circuits.**

In many areas, such as signal processing and the modeling of dynamic systems, problems are often easier to solve in the frequency domain than in the time domain. The behavior of technical systems and processes is often described by linear differential equations with fixed coefficients. Solving these equations may be very complicated. One of the most compelling reasons for using Fourier Transforms is that translating those problems into the frequency domain makes resolving them easier because we replace non-periodic functions with periodic ones.

The fundamental idea behind Fourier transformations is that a function can be expressed as a weighted sum of periodic basis functions such as sines and cosines. In fact, the Fourier transform expresses a signal as a sum of sinusoids which we call “a representation in Fourier space.” Once you’ve transformed a signal into Fourier space using a Fourier transform, you can filter out specific parts of the signal based on the frequency. Then you can transform the signal back into its original space by applying an inverse Fourier transform. When processing sound or images, this is especially useful for filtering noise that would otherwise be hard to detect.

A sinusoid is a common periodic mathematical function used to express frequencies. Sinusoids can be completely described by their period, wavelength, and frequency.

A standard sine wave completes one period (one full cycle) in 2π radians. The basic function describing a sine wave therefore is:

y = sin(2\pi)

In this case, y will equal 0 because the expression sin(2π) gives us the point after one period where y is the amplitude. If we want to calculate a certain point along the sinus function, we need to add the time into the equation. If a full period T lasts one second and we want to find the point along the sinusoid at time t 0.2 seconds, we can add these factors to the sinusoid.

y = sin(\frac{2 \pi t}{T})

y = sin(\frac{2 \pi \times0.2}{1})

Instead of the sine, we could equally use the cosine, which would result in a phase shift by π/2 radians.

**What is a Frequency?**

A frequency expresses the number of occurrences of a repeating event within a certain timeframe. It can be expressed and graphed as a mathematical function.

A frequency of a sinusoid can be calculated by simply taking 1 over the period T (the time it takes for the wave to complete one full cycle).

f = \frac{1}{T}

If the wave needs 1/10 of a second to complete a full cycle, it has a frequency of 10 cycles per second or 10 hertz.

Now we can easily add the frequency into the equation:

y = sin(2 \pi tf)

In practice, we encounter frequencies everywhere where we have signals. For example, a sound frequency is simply a periodic vibration over time that produces a certain pitch.

While Fourier Transforms can be expressed without the use of complex numbers, the expression becomes much more succinct. I’m not going to offer a comprehensive introduction to complex numbers here, but only do a quick recap of the concepts that are necessary for Fourier transforms. For a comprehensive introduction, I highly recommend the free Coursera course Introduction to complex analysis.

A complex number is a number that consists of a real and an imaginary part. The complex number z can be expressed as a sum of a real part (x) and an imaginary part (iy).

z = x + iy

Complex numbers are represented in a coordinate system known as the complex plane, where the real numbers are plotted on the x-axis, and the imaginary numbers are plotted on the y-axis.

So the complex number z can be expressed as a vector in a 2-dimensional coordinate system.

When performing basic operations such as addition and subtraction of complex numbers, you sum or subtract the real and the imaginary parts, respectively.

z_1 + z_2 = (x_1 + x_2) + i(y_1 + y_2)

z_1 - z_2 = (x_1 - x_2) + i(y_1 - y_2)

The magnitude m of a complex number is the square root of the sum of the squares of the real part and the imaginary part.

m = |z| = \sqrt{x^2 +y^2}

From trigonometry, we know that we can represent the sides of a right-angled triangle as products of an adjacent side and the sine or cosine of an angle θ. This knowledge forms the basis of the polar representation in a complex number plane.

Concretely, we can represent the real component x and the imaginary component y as follows:

x = m \;cos (\theta)

y = m \;sin (\theta)

Whith this knowledge, we can represent the complete number z in the polar system.

z=x+iy = m \;cos (\theta) + im \;sin (\theta) = m[\;cos (\theta) + i \;sin (\theta)]

Euler’s identity states that for any real number θ, the following identity holds:

e^{i\theta} = cos(\theta) + i\;sin(\theta)

Since the angle θ is a real number, we can express the complex number z as a product of the magnitude m and Euler’s number e.

z = m e^{i\theta}

Before we move on to Fourier transformations, we need to understand the Fourier series.

**The fundamental idea behind the Fourier series is that a periodic function can be described as a sum of sine and cosine waves of different frequencies.**

To illustrate the idea, let’s take a periodic function y of time t. The function y is a simple step function with period 2π.

If we naively try to approximate the function through a sine wave and a cosine wave with period 2π, we get the following result (orange is the sine wave, and green is the cosine wave).

y(t) = sin(2\pi) \\ y(t) = cos(2\pi) \\

Evidently, the sine wave does a better job at approximating the real y(t) than the cosine wave. But it still isn’t close to the real form of the function. Joseph Fourier had the brilliant insight that we can get arbitrarily close to the real form of the function by adding multiple sines and cosine waves of progressively larger frequencies multiplied by a weighting factor. He expressed it in the Fourier series where a and b are weighting factors, and L is half the period of the function.

y(t) = a_0 + \sum_{n=1} ^N [a_n cos(\frac{n \pi}{L} t) + b_n sin( \frac{n \pi}{L} t)]

Theoretically, you can expand this series for an infinite number of N. The more terms you add in practice, the closer you’ll approximate the original function.

Fourier further derived that given the period 2L, the weighting factors are equivalent to:

a_0 = \frac{1}{2L}\int^L_{-L} f(t)dt

a_n = \frac{1}{L} \int^L_{-L} f(t) cos(\frac{n \pi}{L} t)dt \;\; for\; n={1,2,3,...}

b_n = \frac{1}{L} \int^L_{-L} f(t) sin(\frac{n \pi}{L} t)dt \;\; for\; n={1,2,3,...}

Since, in our case, the period is equivalent to 2π, we can replace L with π:

L = \pi

Accordingly, our representation of the Fourier series can be simplified:

y(t) = a_0 + \sum_{n=1} ^N [a_n cos(n t) + b_n sin( n t)]

Not only will this approach find the correct weighting factors. It will shrink or eliminate sinusoids if they are irrelevant or detrimental and magnify those that are relevant at approximating the function. Sine is an odd function, while cosine is an even function. If you try to approximate an even function, the Fourier series should eliminate the sine terms and retain the cosine term and vice versa.

To see why this works, let’s calculate some of the weighting factors for our square wave. It is an odd function, so the cosine terms should shrink to 0. The square wave forms alternating blocks that I’ve labeled Q above and below the x-axis. The factor a0 gives us the initial offset from the x-axis. Since the square wave has equal proportions below the x-axis and above the x-axis, we expect a0 to be equal to 0.

To see whether this is true, let’s take the formula for a0 and calculate the integral over one full period from 0 to 2π. Since the function is periodic, the a0 calculated over one period should be representative of the full function. We split the calculation into two intervals. One ranges from 0 to π and another one from π to 2π. This allows us to precisely capture the areas Q and for -Q formed by the step function. Accordingly, we will use the same split when calculating the approximations.

a_0 = \frac{1}{2\pi}\int^{2 \pi}_{0} f(t)dt = \frac{1}{2\pi} \left[ \int^{\pi}_{0} f(t)dt +\int^{2\pi}_{\pi} f(t)dt] \right]

a_0 = \frac{1}{2\pi} \left[ Q -Q\right] = 0

Note that the integral calculates the area under the curve. So the integral from 0 to π gives us Q, and the integral from π to 2π gives us -Q. We don’t have to do the full calculation to see that Q and -Q will cancel each other out, resulting in 0.

Let’s calculate the first factor a1 for the cosine approximation over the interval from 0 to 2π.

a_1 = \frac{1}{\pi} \int^{2\pi}_{0} f(t) cos(\frac{1 \pi}{\pi} t)dt

Looking at the plot, we can see that the areas under the curve and above the curve will cancel each other out in the areas from 0 to π and π to 2π. If you resolve the following integrals, you’ll see that they both yield 0.

a_1 = \frac{1}{\pi} \left[ \int^{\pi}_{0} f(t) cos( t)dt +\int^{2\pi}_{\pi} f(t) cos( t)dt \right ]

= \frac{1}{\pi} \left[ 0 + 0 \right ] = 0

Next, we calculate the factor b1 for the sine approximation from 0 to 2π.

b_1 = \frac{1}{\pi} \int^{2\pi}_{0} f(t) sin(\frac{1 \pi}{\pi} t)dt

If we take the integrals again in 2 steps from 0 to π and π to 2π, we get two areas (Q’, and -Q’) that do not cancel each other.

b_1 = \frac{1}{\pi} \left[ \int^{\pi}_{0} Q' sin( t)dt +\int^{2\pi}_{\pi} (-Q')sin( t)dt \right ]

If you evaluate the integrals in these intervals, you should get the following result.

b_1 = 2Q' + 2Q' = \frac{4Q'}{\pi}

As expected and as we’ve seen above in our naive approximation, the cosine terms disappear, and the sine-containing terms are retained because sine does a better job at approximating the function in this case.

You can continue adding terms of increasing frequency to the series, which should get you closer and closer to the actual function.

Here is a little illustration of what the approximation should look like for N=10 and N=50.

N=10

N=50

I won’t continue doing the expansion by hand here, but you are welcome to do so yourself. Here is a great video series that goes through several steps of approximating the square wave. A large part of my explanation is based on this video series.

One of the most mathematical reasons why Fourier transforms work is that the basis functions it relies on (sine and cosine) form an orthogonal basis. Once you’ve projected a function into a new vector basis, you can represent it as a linear combination of the basis vectors. So the Fourier transform is basically a vector projection into the orthogonal basis formed by sine and cosine.

If your memory of vector projections and orthogonal bases is a bit rusty or you don’t understand why this is significant, check out this post on vector projections I wrote (opens in a new tab).

We can show that the Fourier basis is indeed orthogonal by taking our Fourier series with period 2π and showing that the integral over the sum of the basis functions equals zero. In linear algebra, this is equivalent to taking the dot product between two orthogonal vectors, which also equals 0.

f(t)=a_0+\sum_{n=1}^{N}a_n\cos(2\pi nt)+b_n\sin(2\pi nt)

\int_{[0,1]}\cos(2\pi nt)\sin(2\pi mt)dt=0

If you calculate the integral, you should see that this holds wherever n is not equal to m.

In the previous section, we worked with the Fourier series in the real number domain. With Euler’s identity

e^{int} = cos(nt) + i\;sin(nt)

we can express the Fourier series much more succinctly using complex notation (ignoring the initial factor a_0)

y(t) = \sum_{n=1} ^N [a_n cos(n t) + b_n sin( n t)]

y(t) = \sum_{n=1} ^N c_n e^{int}

Or if we use our initial expression of the Fourier Series.

y(t) = \sum_{n=1} ^N c_n e^{ in \omega t } \; where \; \omega = \frac{2\pi}{L}

We’ve established before that the basis functions of the Fourier series form an orthogonal basis. Now we’ve managed to summarize the basis function in one complex coefficient. Accordingly, the complex expression e^{ in \omega t } represents a set of orthogonal basis functions.

Much like above, we can prove this by calculating the inner product between e^{ in \omega t } and its complex conjugate e^{ -im \omega t } . It should equal zero for all expressions where t is not equal to m. We integrate over the interval from -T/2 to T/2 because this gives us a nice symmetric expression.

\frac{1}{L} \int^{L/2}_{-L/2} e^{ in \omega t } e^{ -im \omega t } dt = 0 \;where \;m\neq n

If we think about the Fourier series as a vector projection, this orthogonality relationship allows us to express the coefficient c_m as a vector projection of y(t) onto the orthogonal basis vectors contained in e^{ in \omega t } .

c_m = \frac{y(t) e^{ in \omega t } }{e^{ in \omega t } e^{ in \omega t }} = \frac{\frac{1}{L} \int^{L/2}_{-L/2} y(t) e^{ -im \omega t } dt}{\frac{1}{L} \int^{L/2}_{-L/2} e^{ im \omega t } e^{ -im \omega t } dt }= \frac{1}{L} y(t) e^{ -im \omega t } dt

Again, if the concept of vector projections and why this works is unclear, check out the post on vector projections.

I also recommend watching the following video, which offers a more in-depth explanation of the same concepts.

We’ve done a lot of groundwork in the preceding sections. Getting to the Fourier transform from the Fourier series is now just a small step. The Fourier series can only be used to approximate periodic functions and translate them from the time domain into the frequency domain. But how do we translate non-periodic functions into the frequency domain?

Here is the gist: If we let the period of a periodic function tend towards infinity, we can assume that even an aperiodic function ends at some undefined point and repeats itself. Thus, the function becomes periodic in the limit. Accordingly, we can represent aperiodic functions using a Fourier series with an infinite period. As a consequence, ω becomes an infinitesimally small quantity dω.

L \rightarrow \infty, \;\;w =\frac{2\pi}{L} \;therefore\;w \rightarrow dw

Since the frequency ω becomes infinitesimally small, the discrete frequencies turn into a continuous frequency spectrum. Therefore we need an integral over a period from -\infty, \infty.

The Fourier transform, the transforms a function y(t) from the original domain into the frequency domain can now be expressed as follows:

\hat y(\omega) = F(y(t)) = \int^{\infty}_{-\infty} y(t) e^{-i\omega t} dt

For transforming a function from the frequency domain back to the original domain, we can use the reverse Fourier transform.

y(t) = F^{-1}(\hat y(\omega)) = \frac{1}{2 \pi} \int^{\infty}_{-\infty} \hat y(\omega) e^{i \omega t} dt

That’s it, we have finally arrived at the Fourier Transform. I know that the equations can be intimidating. I have a hard time myself wrapping my head around some of the concepts. If you find a mistake or something that is unclear in my explanation, make sure to let me know in the comments. I’ll try to address it as soon as possible.

The post The Fourier Transform and Its Math Explained From Scratch first appeared on Programmathically.

]]>In this post, we will develop a practical understanding of deep learning for image segmentation by building a UNet in TensorFlow and using it to segment images.

**Image Segmentation is a technique in digital image processing that describes the process of partitioning an image into sections. Often the goal is to identify and extract specific shapes of interest in an image. **

When extracting shapes from images, we distinguish between instance segmentation and semantic segmentation.

**Semantic segmentation attempts to find all objects of a certain category while treating them as one entity within the image. For example, a program that identifies the shapes of several dogs within an image without distinguishing between individual dogs is doing semantic segmentation.**

**Instance segmentation attempts to extract the shape of objects from an image while distinguishing between individual instances of multiple objects within the same category. A program that identifies several dogs within an image and treats each dog as a separate instance instead of as a part of one entity is performing instance segmentation.**

Neural networks have emerged as one of the most promising techniques in image segmentation. In the following tutorial, we will train a neural network to perform semantic segmentation by distinguishing between background, object boundary, and object of interest.

**The UNet is a convolutional neural network architecture developed for image segmentation. The first part of the network consists of an encoder that reduces the spatial information while extracting feature information through a series of convolutional layers and ReLU activation functions. The encoder is followed by a decoder that constructs a mask of the objects from the encoding in the original size of the input image.**

UNet was created by Olaf Ronneberger, Philipp Fischer, and Thomas Brox at the University of Freiburg in Germany.

To follow the tutorial, you’ll need to open a Jupyter Notebook and have TensorFlow installed. If you don’t already have these tools installed, I suggest creating a new virtual environment with anaconda or pip and installing the **jupyter-notebook, tensorflow, numpy, matplotlib, scikit-image, os, and random packages** into that environment.

The TensorFlow installation will automatically include Keras. I’ve used Keras 2.7.0 when writing this tutorial. If you use another version, you might run into errors. To be on the safe side, I suggest running “pip install keras==2.7.0” before you continue with the tutorial.

As an alternative, you can check out Google Colab. Colab is especially interesting if you don’t have a GPU available for training on your local machine. Up to a certain time limit, you can train on GPU for free on Colab.

Next, you can import these packages as follows.

import os import random import numpy as np import tensorflow as tf from tensorflow import keras from keras.preprocessing.image import load_img, array_to_img, img_to_array from keras.models import Model, load_model from keras.layers import Input, Activation, SeparableConv2D, BatchNormalization, UpSampling2D from keras.layers.core import Dropout, Lambda from keras.layers.convolutional import Conv2D, Conv2DTranspose from keras.layers.pooling import MaxPooling2D from keras.layers.merge import concatenate from keras.callbacks import EarlyStopping, ModelCheckpoint from skimage.io import imshow import matplotlib.pyplot as plt %matplotlib inline

We will rely on the Oxford Pets dataset to build our model. You’ll find the data here (on Kaggle). Hit the download button and move the folder into your working directory (you probably will have to sign in to Kaggle before you can download). Inside you’ll find several nested folders. We need the “images” folder that should contain 7390 jpg images of various pets and the “trimaps” folder that contains the segmentation masks corresponding to the pet images.

I’ve arranged the data folders as follows in my working directory:

input_dir = 'pets/pets_images' target_dir = 'pets/pets_annotations/trimaps'

Next, we create a sorted list with all the image paths so that we can import the images and their corresponding segmentation masks in parallel.

input_img_paths = sorted( [ os.path.join(input_dir, fname) for fname in os.listdir(input_dir) if fname.endswith(".jpg") ] ) target_img_paths = sorted( [ os.path.join(target_dir, fname) for fname in os.listdir(target_dir) if fname.endswith(".png") and not fname.startswith(".") ] )

Before we can feed the data to the neural network, we need to ensure the images all have the same size. Therefore, we define a standard size of 256×256. The pictures of the animals are standard RGB images and have three channels, while the masks only have one channel. We are setting the segmentation task up as a classification problem in which the network should learn to classify the pixels into one of 3 classes (background, object boundary, object).

IMG_WIDTH_HEIGHT = 256 IMG_CHANNELS = 3 classes = 3

We define two empty vectors of zeros that we then fill with the images and masks as we import and resize them.

X = np.zeros((len(input_img_paths), IMG_WIDTH_HEIGHT, IMG_WIDTH_HEIGHT, 3), dtype=np.float32) Y = np.zeros((len(input_img_paths), IMG_WIDTH_HEIGHT, IMG_WIDTH_HEIGHT, 1), dtype=np.uint8) for i in range(len(input_img_paths)): img_path = input_img_paths[i] img = load_img(img_path, grayscale=False,target_size=(IMG_WIDTH_HEIGHT, IMG_WIDTH_HEIGHT)) #imread(img_path)[:,:,:3] #.astype("float32") #/ 255.0 -> is done by rgb2gray img = img_to_array(img) X[i] = img.astype('float32') / 255.0 mask_path = target_img_paths[i] mask = load_img(mask_path, grayscale=True,target_size=(IMG_WIDTH_HEIGHT, IMG_WIDTH_HEIGHT)) #imread(img_path)[:,:,:3] #.astype("float32") #/ 255.0 -> is done by rgb2gray mask = img_to_array(mask) Y[i] = mask

Note that we divide the image data by 255. We do this because pixel values in a standard RGB image range from 0-255. To enable the network to learn faster, we want all the values to be on a scale between 0 and 1. To accommodate a huge range of values between 0 and 1, we convert the values to 32Bit encoded floating-point numbers. In the case of the mask, we only have three values (0, 1, and 2). These correspond nicely to the three classes between which we want to distinguish. Accordingly, no division by 255 is necessary, and 8Bit integer encoding is sufficient.

As the last step in our data preparation, we split the data into a training and a testing set using 80% of the samples for training and 20% for testing.

train_test_split = int(len(input_img_paths)*0.8) X_train = X[:train_test_split] Y_train = Y[:train_test_split] X_test = X[train_test_split:] Y_test = Y[train_test_split:]

As a sanity check, let’s visualize a randomly selected sample and its corresponding mask from the training data.

i = random.randint(0, len(X_train)) print(i) imshow(X_train[i]) plt.show() imshow(np.squeeze(Y_train[i])) plt.show()

If we have a look at the UNet architecture diagram above, we see that the contracting path follows a pattern. There are always two convolutional layers with ReLU activation functions followed by a Max Pooling layer that reduces the vertical and horizontal image dimensions by half while expanding the third dimension.

We can implement this functionality in a Python function that takes the feature map from the previous layer as an input. If it is the first layer, we pass in the image. Furthermore, we pass the number of filters (you might want to change the number of filters if the dimensions of your input change). The dropout probability allows us to implement an additional dropout layer for regularization. Lastly, we have a boolean flag indicating whether we want to apply max pooling.

def convolutional_block(inputs=None, n_filters=32, dropout_prob=0, max_pooling=True): conv = Conv2D(n_filters, kernel_size = 3, activation='relu', padding='same', kernel_initializer=tf.keras.initializers.HeNormal())(inputs) conv = Conv2D(n_filters, kernel_size = 3, activation='relu', padding='same', kernel_initializer=tf.keras.initializers.HeNormal())(conv) if dropout_prob > 0: conv = Dropout(dropout_prob)(conv) if max_pooling: next_layer = MaxPooling2D(pool_size=(2,2))(conv) else: next_layer = conv #conv = BatchNormalization()(conv) skip_connection = conv return next_layer, skip_connection

Inside the function, we have two convolutional layers with a kernel size of 3. Then we optionally activate the dropout layer and the Max Pooling layer depending on the parameters we’ve passed to it. The max-pooling layer uses a kernel size of 2×2, which reduces the feature map by 1/2 along the horizontal and vertical dimensions. In a UNet, we employ skip connections, which means that we want to preserve the feature map before pooling and pass it to the corresponding upsampling block. Accordingly, we save the feature map to a variable called skip connection, without the pooling applied to it.

Similarly, we can summarize the steps in the expansive part in a function. The expansive part performs a series of upsampling operations that basically reverse the convolutions and reconstruct the original image. As input, we pass the output from the preceding layer and the skip connection from the corresponding downsampling block. The horizontal arrows in the diagram nicely illustrate how the skip connections travel. Lastly, we also pass the number of filters.

def upsampling_block(expansive_input, contractive_input, n_filters=32): up = Conv2DTranspose( n_filters, kernel_size = 3, strides=(2,2), padding='same')(expansive_input) merge = concatenate([up, contractive_input], axis=3) conv = Conv2D(n_filters, kernel_size = 3, activation='relu', padding='same', kernel_initializer=tf.keras.initializers.HeNormal())(merge) conv = Conv2D(n_filters, kernel_size = 3, activation='relu', padding='same', kernel_initializer=tf.keras.initializers.HeNormal())(conv) return conv

First, we perform the inverse of the convolutions and max-pooling operations on the input we receive from the preceding block using the Conv2DTranspose layer with a filter size of 3×3 and a stride of 2×2. Now, the input should have the same dimensions as the skip connection we received from the contractive block. This means that we can concatenate them before pushing the combined output through 2 convolutional layers to extract the features for the final mask we want to produce.

Lastly, we can construct our UNet in a new function using the functions defined above. As input, we require the dimensions of the input image, along with the number of filters and the number of classes we want to distinguish between.

def unet_model(input_size=(IMG_WIDTH_HEIGHT, IMG_WIDTH_HEIGHT , IMG_CHANNELS), n_filters=32, n_classes=3): inputs = Input(input_size) #contracting path cblock1 = convolutional_block(inputs, n_filters) cblock2 = convolutional_block(cblock1[0], 2*n_filters) cblock3 = convolutional_block(cblock2[0], 4*n_filters) cblock4 = convolutional_block(cblock3[0], 8*n_filters, dropout_prob=0.2) cblock5 = convolutional_block(cblock4[0],16*n_filters, dropout_prob=0.2, max_pooling=None) #expanding path ublock6 = upsampling_block(cblock5[0], cblock4[1], 8 * n_filters) ublock7 = upsampling_block(ublock6, cblock3[1], n_filters*4) ublock8 = upsampling_block(ublock7,cblock2[1] , n_filters*2) ublock9 = upsampling_block(ublock8,cblock1[1], n_filters) conv9 = Conv2D(n_classes, 1, activation='relu', padding='same', kernel_initializer='he_normal')(ublock9) #conv10 = Conv2D(n_classes, kernel_size=1, padding='same', activation = 'softmax')(conv9) conv10 = Activation('softmax')(conv9) model = tf.keras.Model(inputs=inputs, outputs=conv10) return model

The standard UNet architecture cuts the horizontal and vertical dimensions in half while doubling the filter dimensions in every contractive block. In addition, we apply dropout to the last two blocks and suspend pooling in the last block since no further dimensionality reduction is required according to the architecture diagram. The upsampling blocks should be self-explanatory.

After the upsampling, we use a convolutional layer with a 1×1 kernel size and a softmax activation function to produce the classification of the pixels into one of the three classes. If we only had two classes you could also use the sigmoid activation function.

Lastly, we need to combine everything into a Keras model. Thanks to the Keras functional API, this is as simple as passing the input and the expected output.

Let’s initialize the network and look at the summary as a sanity check. You should get a nice summary that approximately looks like the following. Note that I’ve only printed the beginning and that your layer numbers will likely be different. The most important thing is that the dimensions in the output shape match your expectations.

unet = unet_model((IMG_WIDTH_HEIGHT, IMG_WIDTH_HEIGHT, IMG_CHANNELS), n_classes=3) unet.summary()

We are finally ready to compile our model and start to train it.

EPOCHS = 30 unet.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) earlystopper = EarlyStopping(patience=5, verbose=1) model_history = unet.fit(X_train, Y_train, validation_split=0.1, batch_size=16, epochs=EPOCHS, callbacks=[earlystopper])

When compiling, make sure you use sparse categorical cross-entropy. If we had a binary classification problem, we could also use binary cross-entropy. Adam is generally a good default optimizer but feel free to play around with others. I decided to train for 30 epochs with an early stopping callback in case the accuracy stops improving earlier. My validation accuracy tops out at approximately 83%.

That is enough for our purpose. Now, we can use the model to create segmentation masks for our test data.

predictions = unet.predict(X_test)

As a sanity check and to see how our hard work has paid off, we can visualize a random image, its corresponding mask, and the prediction generated by the model.

i = 1 imshow(X_test[i]) plt.show() imshow(np.squeeze(Y_test[i])) plt.show() imshow(np.squeeze(predictions[i])) plt.show()]]>

**An autoencoder is a neural network trained to compress its input and recreate the original input from the compressed data. This procedure is useful in applications such as dimensionality reduction or file compression where we want to store a version of our data that is more memory efficient or reconstruct a version that is less noisy than the original data. **

Autoencoders consist of two main components to achieve the goal of reconstructing data:

- an encoder network that learns to create a compressed or encoded version of the input data.
- a decoder network that is trained to reconstruct the original data from the compressed version.

Mathematically, we can describe the encoder as a function f that takes an input x to create an encoding h.

h = f(x)

The decoder represents another function, d that takes h as an input to produce a reconstruction r.

r = d(h)

Essentially, the autoencoder is trained to learn the following mapping:

r = d(f(x))

As you can see, the autoencoder does not learn to exactly reconstruct the original input x but r, which is an approximation to x.

An important feature of autoencoders is that they are unable to learn the identity function and reconstruct the original input perfectly. Instead, they are designed in a way that only allows for approximate recreations.

If we allowed an autoencoder to reconstruct the original image, it might default to just learning the identity function. Then the encoding will look exactly like the input making the autoencoder essentially useless.

The hidden layers have fewer nodes than the input and output layers to prevent the networks from learning the identity function. The number of nodes in the output layer equals the number of nodes in the input layer to allow for a reconstruction that is similar to the original.

Reducing the number of hidden nodes available to fewer than the input forces the autoencoder to prioritize the features it wants to preserve. These features need to be useful for reconstruction, which induces the network to focus on useful properties and ignore the noise.

Let’s say we wanted to build an autoencoder to encode and reconstruct an image of 20×20 = 400 pixels. Since we don’t use convolutions, we require 400 neurons in the input and output layers. If we equipped the hidden layers with 400 neurons as well, the autoencoder could simply pass through the single pixels without learning anything about the relationship between the pixels. Let’s cut the number of neurons in the hidden layers down to 200. Now, the network is forced to learn a useful representation of the 400 pixels it receives from the input layer that can be processed by 200 neurons and which allows for a reconstruction of the full 400 pixels that is as accurate as possible.

We call networks, which do not have the capacity to learn the full mapping, under-complete. As we discuss later in this article, there are other ways to force autoencoders to not learn the full mapping.

Compared to deterministic methods for data compression, autoencoders are learned. This means they rely on features that are specific to the data that the autoencoder has been trained on. An autoencoder will perform poorly with data it hasn’t been trained on.

**A variational autoencoder, in contrast to an autoencoder, assumes that the data has been generated according to some probability distribution. The variational autoencoder tries to model that distribution and find the underlying parameters. Learning a probability distribution allows variational autoencoders to not only reconstruct data but to generate new data on the basis of that distribution.**

To build a variational autoencoder in practice, we still encode the data using an encoder network. But instead of creating an encoding straight from the output of the encoder, we use the output to parameterize a Gaussian distribution. To generate the encoding, we sample the data from that distribution. The decoder then learns to recreate the image from the encoding. Due to the stochasticity inherent in the sampling process and thus in the encoding, the reconstruction will be similar but different from the original. For example, if you feed an image of a dog to the autoencoder, the reconstruction should be a different dog.

Mathematically, instead of defining the encoder as a function mapping x to a lower-dimensional encoding h, we define it as a conditional probability distribution that produces h given x.

p_{encoder} (h |x)

Similarly, we define the decoder as producing r given the encoding h.

p_{decoder}(r|h)

In practice, you

- feed your input x (such as an image) to the encoder
- train your encoder to produce μ and σ, the parameters of a Gaussian distribution. A Gaussian distribution is parameterized by a mean (μ) and a standard deviation (σ)
- randomly sample a term ε from a standard Gaussian distribution (with mean 0 and standard deviation 1)

\epsilon \sim N(0, 1)

- create the latent encoding vector h by multiplying σ with the randomly sampled ε and adding it to the mean μ. This creates the stochasticity that enables the variational autoencoder to generate new data. Note that there are different versions of this formula.

h = \mu + \sigma \odot \epsilon

- Lastly, you pass h to the decoder which produces r in an attempt to recreate x.

Just as other neural networks, autoencoders use cost functions to measure the reconstruction loss and reduce it through backpropagation with gradient descent.

The goal of the autoencoder is to reduce the loss between the original input x and the reconstruction r that has been generated by the encoder f and the decoder d.

L(x, r) = L(x, d(f(x)))

Several loss functions are applicable to variational autoencoders.

The mean squared error usually works pretty well in vanilla autoencoders. It measures the squared distances between an observation in x and in r.

L(x_i,r_i) = || x_i -r_i||^2

The cost across the entire dataset is the sum of the squared distances between the observations in x and r.

J(x, r) = \frac{1}{n}\sum^n_{i=1} (x_i - r_i)^2

If your data is binary or at least in the range [0,1], you can also use binary cross-entropy in vanilla autoencoders.

J(x, r) = - \sum^n_{i=1} x_i log(r_i) + (1-x_i) log(1-r_i)

When training variational autoencoders, we do not aim to reconstruct the original input. Instead, the goal is to generate a new output that is reasonably similar to the input but nevertheless different. Accordingly, we need a way to quantify the degree of similarity rather than the exact difference. We can achieve this by measuring the difference between the probability distributions underlying the data generating process of the input and the output.

As the name implies, the Kullback-Leibler divergence measures the divergence between 2 probability distributions, P and Q.

D_{KL}(P || Q)

In a discrete data setting, the Kullback-Leibler divergence is calculated as follows:

D_{KL}(P||Q) = \sum_{i=1}^n P(x_i) log(\frac{Q(x_i)}{P(x_i)})

P(x) represents the distribution of the encoder p_{encoder} (h |x) that projects our input variable x into latent space to create the encoding h. As discussed in the section above, when training a variational autoencoder, we let the encoder produce the parameters σ and μ of a normal distribution.

The standard normal distribution is our prior distribution p(h) if we don’t know anything about the input data. To quantify the loss, we measure the divergence of p_{encoder} (h |x) from a standard normal distribution using the Kullback-Leibler divergence.

D_{KL}(p_{encoder} (h |x) || p(h)) = D_{KL}(N(\mu, \sigma) || N(0,1))

In practice, we plug the values generated by the encoder into a derivation of the Kullback Leibler divergence for Gaussian distributions.

D_{KL}(p_{encoder} (h |x) || p(h)) =

\frac{1}{2}\left[-\sum_i\left(\log\sigma_i^2 + 1\right) + \sum_i\sigma_i^2 + \sum_i\mu^2_i\right]

The derivation goes beyond the scope of this post. If you are interested in it, here is the derivation in a post on Stats Stackexchange.

To calculate the full variational autoencoder loss, we calculate the loss between the original input and the reconstruction. Then we add the Kullback Leibler divergence between the probability distributions underlying the data generating process of the input and the output.

L = \frac{1}{n}\sum^n_{i=1} (x_i - r_i)^2 + \frac{1}{2}\left[-\sum_i\left(\log\sigma_i^2 + 1\right) + \sum_i\sigma_i^2 + \sum_i\mu^2_i\right]

Autoencoders with an equal or higher number of neurons compared to the input layer have too much capacity and are known as overcomplete. As previously stated, we usually design autoencoders to be under-complete. By giving the hidden layers fewer neurons and thus less capacity, we prevent the autoencoder from simply copying the input to the output without learning anything about the underlying structure in the data.

In addition to the number of neurons in the hidden layers, the capacity of an autoencoder also depends on the number of hidden layers or depth of the network. Designing the capacity of autoencoders and variational autoencoders to fit the complexity of the data distribution is one of the major challenges in training those networks.

Regularization techniques can help us prevent an autoencoder from simply copying its input to its output, even in the over-complete case. Thus, regularization allows us to add more capacity to models by increasing their depth and breadth while enabling them to learn useful mappings.

Note that variational autoencoders are trained to maximize a probability distribution given the input data rather than learning to reconstruct the data directly. Remember that the generation of the encodings involves a random term. Accordingly, a variational autoencoder cannot learn a deterministic identity function which is why regularization is not necessary.

Sparsity in the context of neural networks means that we deactivate neurons and thus prevent them from transmitting an impulse. The underlying idea is that if we enforce a sparsity constraint on the hidden layers in an autoencoder, it shouldn’t be able to learn the identity function even if the layer is large.

Standard activation functions such as the logistic sigmoid constrain the impulse to a value between 0 and 1. Generally speaking, if the output of a neuron is close to 0, it won’t transmit an impulse, while if it’s close to 1, it will. In a sparse layer, we want many of the neurons to have outputs close to 0 to prevent them from firing.

One way to enforce sparsity on the hidden layer is to penalize the average activations in the hidden layer through the cost function. Using the L1 norm, we can add a regularization term Ω that is a function of the encoding layer h and has the effect of shrinking many of the parameters in h to 0. With the hyperparameter λ, we can control the regularizing effect.

J(x, r) = \frac{1}{n}\sum^n_{i=1} (x_i - r_i)^2 + \lambda \Omega(h)

*If you don’t know why L1 regularization can set parameters equal to 0, check out my post on regularization in machine learning.*

An approach that is described in detail in this Stanford lecture enforces a penalty using the average activations in the hidden layer. The idea is that in a hidden layer you want to make sparse, the average activation \hat p across all neurons should be relatively small because most neurons produce no impulses. So you can set a relatively small target value p and use a measure of the difference as a penalty on the cost function. If many neurons in the hidden layer are active, the average activation will be relatively large, and thus the penalty on the cost is larger as well.

Denoising autoencoders work by corrupting the input source of the data. The autoencoder is then trained to reconstruct the original data from the corrupted input. For example, if you are training an autoencoder to reconstruct pictures from photos that have been blurred.

If x is our original input data, and \hat x is the corrupted version of x, the denoising autoencoder is trained to reduce the following loss:

L(x, r) = L(x, d(f(\hat x)))

The ability of neural networks to learn complex mappings can generally be improved by either increasing the number of neurons in the layers, or by increasing the number of layers. Since autoencoders consist of feedforward networks, the same reasoning applies to them.

In general, increasing the number of layers and thus making the network deeper as opposed to simply adding more neurons, can dramatically reduce the computational cost and training data required for learning certain functions. If you are interested in why that is the case, check out this paper.

It has also been determined experimentally that deep autoencoders are better at compressing data than shallow ones.

In this post, we will develop a thorough understanding of skip connections and how they help in the training of deep neural networks. Furthermore, we will have a look at ResNet50, a popular architecture based on skip connections.

**In a nutshell, skip connections are connections in deep neural networks that feed the output of a particular layer to later layers in the network that are not directly adjacent to the layer from which the output originated.**

In the following sections, we are going to look at skip connections in detail and understand how they mitigate the problem of vanishing gradients and degrading accuracy that often accompany very deep neural network architectures.

As researchers designed neural networks with more and more layers, the problem of vanishing and exploding gradients became increasingly prominent.

Vanishing and exploding gradients are more problematic in deep networks because, during backpropagation, each additional layer adds values to the multiplication we need to perform to obtain the gradient.

If you don’t know why this is the case, check out my post on vanishing and exploding gradients.

If you multiply several values that are smaller than 1 together, the result will shrink exponentially with every term you add to the multiplication.

If you multiply several terms that are larger than 1 together, the result will grow exponentially with every term you add to the multiplication. The more layers we have, the more multiplications we have to perform.

Furthermore, it has been empirically verified that the performance of very deep neural networks with traditional architectures tends to degrade, and accuracy saturates as the network converges.

To alleviate these problems, a skip connection bypasses one or more layers and their associated operations.

If the output of one of the earlier layers is x_0, a traditional neural network would perform the following operations in the next layers.

z_1 = w_1 x_0 + b_1

x_1 =ReLU(z1)

z_2 = w_2 x_1 + b_2

x_2 = ReLU(z_2)

With a skip connection, you still perform these operations, but in addition, x_0 also bypasses them and rejoins the output before the second activation function through simple addition.

x_2 = ReLU(z_2 + x_0)

The ReLU activation function outputs the input directly if it is positive. Thus, If z_2 were equal to 0 and we only passed (positive) x_0 through the ReLU activation, the output would be as follows.

x_2 = ReLU(0 + x_0) = x_0

In other words, the skip connection essentially calculates the identity function of x_0. This fact will come in handy in the next sections when we try to understand why residual networks work so well.

Taken together, these operations form a residual block.

Let’s summarize the intermediate calculations in the residual block without skip connections as a function.

F(x)

*Since we don’t need the intermediate x value anymore, x is equivalent to x_0 in the following explanations.*

When we introduce a skip connection that lets x_0 bypass all calculations in F(x). We express this new operation as H(x). Since x is added to the output of F(x), H(x) looks like this:

H(x) = F(x) + x

During the intermediate calculations, F(x) may become very small or otherwise change in a way that degrades the ability of the neural network to learn a complex mapping. In practice, the likelihood that this will happen increases as you add more layers to a network. By adding x, you are adding the identity mapping of the input to F(x). This gives the network a kind of baseline for the function it has to learn. Remember that the basic goal of a neural network is to find a mapping.

x \rightarrow y

In the worst case, when using a skip connection, the network could fall back to using the identity mapping.

x \rightarrow x

In deep networks, this gives the model the ability to ignore the mapping of specific layers. Imagine you built a network with 300 layers, and it turned out that 50 layers are just fine to learn your problem; skip connections and residual blocks would allow your network to bypass the remaining 250 layers.

During backpropagation, we differentiate the loss function with respect to the weights. To get to the weights in earlier layers, we have to differentiate through all the intermediate functions represented in the previous layers. Let’s imagine for a moment that the last activation function in the residual block produces the final output and that we calculate the cost J right afterward.

To propagate the gradient back to the layers before the residual block, we, therefore, need to calculate the gradient of the cost with respect to x.

\frac{\partial J}{\partial x}

Using the chain rule, the full operation with all the intermediate steps to be performed without the skip connection looks like this:

\frac{\partial J}{\partial x_0} = \frac{\partial J}{\partial x_2} \frac{\partial x_2}{\partial z_2} \frac{\partial z_2}{\partial x_1} \frac{\partial x_1}{\partial z_1} \frac{\partial z_1}{\partial x_0}

This is a long series of multiplications that, as mentioned above, makes neural networks susceptible to vanishing and exploding gradients.

If we replace the intermediate calculations by F(x), the gradient calculation now becomes:

\frac{\partial J}{\partial x} = \frac{\partial J}{\partial F(x)} \frac{\partial F(x)}{\partial x}

When adding the skip connections, we introduced the additional function H(x) = F(x) + x. To calculate the gradient of the cost function in a network with skip connections, we now need to differentiate through H(x).

\frac{\partial J}{\partial x} = \frac{\partial J}{\partial H(x)} \frac{\partial H(x)}{\partial x}

The derivative of x with respect to x is equal to 1. Thus, substituting F(x) + x for H(x) gives us the following expression:

\frac{\partial J}{\partial x} = \frac{\partial J}{\partial H(x)} \left( \frac{\partial F(x)}{\partial x} + 1 \right) = \frac{\partial J}{\partial H(x)} \frac{\partial F(x)}{\partial x} + \frac{\partial J}{\partial H(x)}

Even if the gradient of F(x) becomes vanishingly small due to the many multiplications performed when backpropagating through all the layers, we still have the direct gradient of the cost function with respect to H(x).

That way, the network can also skip some of the gradient calculations during backpropagation and prevent the gradient from vanishing or exploding.

Conceptually, the layers in a convolutional neural network learn features in increasing levels of abstraction. Skip connections allow a deep neural network to pass lower-level semantic information unchanged to the latter layers of the neural network. This helps preserve information that would have been abstracted away if it had been passed through too many layers.

The primary architectures that build on skip connections are ResNets and DenseNets. After AlexNet, ResNets constituted the next big breakthrough in deep learning for computer vision, winning the imagenet classification challenge in 2015. As expected, the main advantage of ResNets was that they enabled the construction of much deeper neural networks without running into exploding and vanishing gradients or degradation of accuracy.

The first ResNet architectures ranged from 18 layers to 152 layers. Nowadays, some researchers rely on skip connections to construct networks with more than 1000 layers.

For illustration, let’s have a look at ResNet50, a common ResNet architecture with a total of 50 layers.

The architecture takes a 224x224x3 image as input, followed by a MaxPooling layer with a 3×3 filter. Then it has 16 residual blocks. Each residual block consists of 3 convolutional layers that transform the dimensionality of the data, as indicated in the illustration. The residual blocks form 4 groups. Within a group, the residual blocks have similar layers. For example, in the first group, we have three residual blocks that each consist of 3 convolutional layers that perform the following convolutions.

- 1x1x65
- 3x3x64
- 1x1x256

In the second group, we have 4 residual blocks with 3 convolutional layers that perform the following convolutions:

- 1x1x128
- 3x3x128
- 1x1x512

The third group consists of 6 residual blocks with 3 convolutional layers:

- 1x1x256
- 3x3x256
- 1x1x1024

The last group has 3 blocks, each of which has layers performing the following convolutions:

- 1x1x512
- 3x3x512
- 1x1x2948

Lastly, the architecture performs global average pooling before feeding the feature map to a fully connected layer that distinguishes between the 1000 images used in the ImageNet database.

]]>In this post, we will learn to build a basic convolutional neural network in TensorFlow and how to train it to distinguish between cats and dogs. We start off with a simple neural network and gradually work our way towards more complex architectures evaluating at each step how the results are changing.

To build this network, I’ve used Google Colab, which lets you train your network on a GPU for free.

The dataset we are going to use is called cats and dogs. While this dataset can be downloaded directly from TensorFlow, you will need to learn how to import external data if you are serious about deep learning. One of the best places to find training data for your projects is Kaggle. We will obtain the dataset through Kaggle and download it into Google Colab.

If you do not have a Kaggle account, sign up for one. Don’t worry. It is free. Once you’ve signed up, navigate to your account page. There is a small symbol for your profile in the upper right-hand corner from which you can navigate to your account. Once you are there, scroll down to the “API” section and click “Create New API Token”. This will download a Kaggle.json file.

You need to place the Kaggle.json file in your working directory. If you are in Colab, you have a section on the right that gives you an overview of your working directory. Just drag and drop the file in there (note that this will delete the file if you shut down Colab, so you’ll need to re-upload it).

Once the file is in place, you need to install the Kaggle library and move the Kaggle.json file into the directory structure required by Kaggle. This can be achieved with the following commands.

! pip install kaggle ! mkdir ~/.kaggle ! cp kaggle.json ~/.kaggle/ ! chmod 600 ~/.kaggle/kaggle.json

Next, you download the cats and dogs dataset into your working directory and unzip it.

! kaggle datasets download chetankv/dogs-cats-images ! unzip dogs-cats-images.zip

You should now have a directory called dataset in your working directory. Inside the dataset, we have a training set and a test set, each of which contains folders for the images of cats and dogs.

To read the data for our model, we need to define the paths to the images.

train_dir = '/content/dataset/training_set' train_dir_cats = os.path.join(train_dir,'cats') train_dir_dogs = os.path.join(train_dir,'dogs') test_dir = '/content/dataset/test_set' test_dir_cats = os.path.join(test_dir,'cats') test_dir_dogs = os.path.join(test_dir,'dogs')

To check that we can import the images correctly, we print the number of images in each directory.

print(len(os.listdir(train_dir_cats))) #4000 print(len(os.listdir(train_dir_dogs))) #4000 print(len(os.listdir(test_dir_cats))) #1000 print(len(os.listdir(test_dir_dogs))) #1000

Now it is finally time to import the images and prepare them for training. To do that, we will use the Keras Image data generator. Data generators let you import data from directories, as well as manipulate and preprocess it on the fly.

We first define the image data generators for the training and the test data and tell them what preprocessing steps to perform. In this case, we just want to scale the pixel values to a range from 0 – 1. Pixel values are on a scale from 0 – 255, which is why we divide by 255.

train_generator = ImageDataGenerator(rescale=1./255) validation_generator = ImageDataGenerator(rescale=1./255)

Now we can import the data through the generator using the “flow_from_directory” method. We also specify in the function call that we want the generator to resize the images to 300×300 pixels and feed them to the models in batches of 32. Since we distinguish between cats and dogs, we have a binary classification problem.

train_generator = ImageDataGenerator(rescale=1./255) train_generator = train_generator.flow_from_directory( train_dir, target_size=(300, 300), batch_size=32, class_mode='binary') validation_generator = ImageDataGenerator(rescale=1./255) validation_generator = validation_generator.flow_from_directory( test_dir, target_size=(300, 300), batch_size=32, class_mode='binary')

When constructing a neural network for image classification, we gradually need to transform our images so that the network can ultimately decide between 2 or more classes.

In convolutional neural networks, you’ll usually achieve this by applying a convolutional layer with several filters to extract features, followed by a pooling layer to summarize features. Both convolutional filters and pooling filters can be used to reduce the dimensionality of the image. This gradually transforms the input image into a feature map where the required features become more prevalent.

In many convolutional neural networks, you’ll often find a pattern where people apply several pairs of convolutional layers followed by pooling layers. The number of filters usually increases in the latter convolutional layers while the input shrinks along the first two dimensions.

In the last part of the network, the input is flattened, and you’ll usually find a few fully connected layers that lead into the output layer, which performs the final classification. To learn more about common convolutional architectures, check out my introduction to convolutional neural network architectures, as well as my post on deep learning architectures for image classification.

We start out with a simple architecture consisting of 2 pairs of convolutional layers and pooling layers. We use a filter size of 3×3 and a ReLU activation for the convolutional layers and a filter size of 2×2 with max-pooling for the pooling layers. All these settings are pretty standard in modern convolutional architectures.

We increase the number of filters from 32 to 64 before flattening the feature map. Lastly, we have another fully connected layer with 512 neurons which leads directly into our output layer with 2 neurons and a sigmoid activation function since we have a binary classification problem (cats vs. dogs). The layers are tied together using Keras’ sequential API.

model = models.Sequential([ layers.Conv2D(32, (3,3), activation="relu", input_shape=(300,300, 3)), layers.MaxPooling2D((2,2)), layers.Conv2D(64, (3,3), activation="relu"), layers.MaxPooling2D((2,2)), layers.Flatten(), layers.Dense(512, activation="relu"), layers.Dense(2, activation="sigmoid") ])

Then we compile the model using the Adam optimizer and sparse categorical cross-entropy as the loss function and sparse categorical accuracy as the metric.

model.compile( loss= tf.keras.losses.SparseCategoricalCrossentropy(), optimizer = optimizers.Adam(), metrics = [tf.keras.metrics.SparseCategoricalAccuracy()] )

Using the summary function, we can print the model architecture. This is especially useful to check how the dimensionality of the input changes from layer to layer.

print(model.summary())

After investigating the architecture and making sure that it matches our expectations, we finally get to training the neural network. To do that, we call the fit method on the model passing the training and the validation generator for the datasets. In addition, we also tell the function for how many epochs we want to train and how many steps per epoch to use.

history = model.fit( train_generator, epochs = 20, steps_per_epoch = 100, validation_data = validation_generator, validation_steps = 50 )

The training will take a while, depending on what hardware you are running the model on. TensorFlow will print the current progress to the console after each epoch.

When we look at the output, we can clearly see that something is wrong. The training loss steadily decreases, and the training accuracy increases as expected. However, the validation loss increases while the accuracy increases first to a relatively unimpressive 70% and then starts to decrease.

This looks like the model is overfitting the training data. To get a better overview of what is happening, I’ve defined the following functions to plot the training and validation accuracy.

def get_metrics(history): history = history.history acc = history['sparse_categorical_accuracy'] val_acc = history['val_sparse_categorical_accuracy'] loss = history['loss'] val_loss = history['val_loss'] return acc, val_acc, loss, val_loss def plot_train_eval(history): acc, val_acc, loss, val_loss = get_metrics(history) acc_plot = pd.DataFrame({"training accuracy":acc, "evaluation accuracy":val_acc}) acc_plot = sns.lineplot(data=acc_plot) acc_plot.set_title('training vs evaluation accuracy') acc_plot.set_xlabel('epoch') acc_plot.set_ylabel('sparse_categorical_accuracy') plt.show() loss_plot = pd.DataFrame({"training loss":loss, "evaluation loss":val_loss}) loss_plot = sns.lineplot(data=loss_plot) loss_plot.set_title('training vs evaluation loss') loss_plot.set_xlabel('epoch') loss_plot.set_ylabel('loss') plt.show()

When calling the second function, we get a nice plot.

plot_train_eval(history)

Glancing at the plots, we see immediately that the validation accuracy flatlines right after the first few epochs while the validation loss increases.

Let’s see if we can improve on these results by building a slightly more complicated network. But before we do this, let’s wrap the construction of the first network into a function call so we can easily build it and distinguish it from the other models we are going to build.

def build_model_A(): model = models.Sequential([ layers.Conv2D(32, (3,3), activation="relu", input_shape=(300,300, 3)), layers.MaxPooling2D((2,2)), layers.Conv2D(64, (3,3), activation="relu"), layers.MaxPooling2D((2,2)), layers.Flatten(), layers.Dense(512, activation="relu"), layers.Dense(2, activation="sigmoid") ]) model.compile( loss= tf.keras.losses.SparseCategoricalCrossentropy(), optimizer = optimizers.Adam(), metrics = [tf.keras.metrics.SparseCategoricalAccuracy()] ) return model

The network we’ll build next is essentially a copy of the first network, but we’ve duplicated the first two blocks, each consisting of a convolutional layer and a max-pooling layer. This means we have two convolutional layers with 32 filters followed by max-pooling layers and two convolutional layers with 64 filters followed by max-pooling layers.

def build_model_B(): model = models.Sequential([ layers.Conv2D(32, (3,3), activation="relu", input_shape=(300,300, 3)), layers.MaxPooling2D((2,2)), layers.Conv2D(32, (3,3), activation="relu"), layers.MaxPooling2D((2,2)), layers.Conv2D(64, (3,3), activation="relu"), layers.MaxPooling2D((2,2)), layers.Conv2D(64, (3,3), activation="relu"), layers.MaxPooling2D((2,2)), layers.Flatten(), layers.Dense(512, activation="relu"), layers.Dense(2, activation="sigmoid") ]) model.compile( loss= tf.keras.losses.SparseCategoricalCrossentropy(), optimizer = optimizers.Adam(), metrics = [tf.keras.metrics.SparseCategoricalAccuracy()] ) return model

When we build the model and evaluate its architecture, we see that the last pooling layer right before we flatten the output produces a feature map of only 16x16x64.

model_2 = build_model_B() print(model_2.summary())

We can tell TensorFlow to stop training early if a certain metric such as the validation accuracy doesn’t improve any further. To stop early, we define a callback and pass it to the fit function before we start to train the model.

early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_sparse_categorical_accuracy", patience=3) history_b = model_b.fit( train_generator, epochs = 50, steps_per_epoch = 100, validation_data = validation_generator, validation_steps = 50, callbacks= [early_stopping] )

We’ve set the metric we’d like to monitor to validation accuracy with a patience value of 3. The model will stop training if the validation accuracy hasn’t improved for 3 consecutive epochs.

The accuracy has improved by about 5-6 percentage points. It seems we are on the right path.

As the next step, we can add data augmentation to see if it improves our network. When working with images, we can easily multiply our training data by rotating, flipping, or shearing the images. With Keras’ inbuilt data generator, this is easy to implement by simply specifying the transformations we want to perform when instantiating the training data generator.

train_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True) train_generator = train_datagen.flow_from_directory( train_dir, target_size=(300, 300), batch_size=32, class_mode='binary')

We don’t need to perform any augmentation on the validation set since the purpose of augmentation is to provide the network with more variety at training time.

model_b_2 = build_model_B() early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_sparse_categorical_accuracy", patience=3) history_b_2 = model_b_2.fit( train_generator, epochs = 30, steps_per_epoch = 100, validation_data = validation_generator, validation_steps = 50, callbacks= [early_stopping] )

It looks like data augmentation didn’t help increase our validation accuracy and reduce our validation loss particularly. It decreases our training accuracy, which is probably due to the improved image diversity resulting from our data augmentation.

Rather than building and training our convolutional neural network from scratch, we can import an established model architecture that has been pretrained on another dataset. TensorFlow and Keras make this easy.

We import the InceptionResNetV2 from Tensorflow. This is an advanced convolutional architecture that has been pretrained on more than a million images from the ImageNet database to distinguish between 1000 classes. We specify that we want the pretrained weights from imagenet. Furthermore, we need to pass the input shape and tell the network that we don’t need the top. We will attach the last few layers ourselves so that we can use the network for our dataset.

All of this can be achieved in a single line of code, and TensorFlow will download the model the weights and instantiate InceptionResNet as a base model for us.

base_model = tf.keras.applications.InceptionResNetV2(weights='imagenet', include_top=False, input_shape=(300, 300, 3))

After instantiating the base model, we set the trainable parameter to false. This indicates to TensorFlow that we only want to train the new layers we add but that the pretrained layers should be left untouched. If you have a lot of training data, you may also train the layers of your base model. But in such a simple case as distinguishing between cats and dogs, it would be overkill.

base_model.trainable = False

Now, we can add our own layers and build the model. Essentially, we just want to add the output layer that defines the distinction between two classes instead of 1000 and one or two dense layers to help the network learn. We use the Keras Sequential API and add the layers one by one starting with the layers of the base model. Instead of using a “Flatten” layer, we add a global average pooling layer to transition the output from the format provided by the pretrained model to one that we can use for our fully connected layers.

Global average pooling shrinks the size of the spatial dimensions down to one along all the axes except for the last by averaging according to the last axis. The last axis is usually the number of filters.

I’ve used global average pooling here to introduce the layer, but you can equally use the “Flatten” layer.

The remaining part of the code is very similar to what we’ve used in the previous models. In the last fully connected layer, I’ve used 128 instead of 512 neurons. This choice is due to experimentation. When building a neural network, you often have to find values such as the number of neurons through trial and error.

def build_model_C(base_model): model = tf.keras.models.Sequential([base_model]) model.add(tf.keras.layers.GlobalAvgPool2D()) model.add(tf.keras.layers.Dense(128,activation='relu')) model.add(tf.keras.layers.Dense(2, activation = 'sigmoid')) model.compile( loss= tf.keras.losses.SparseCategoricalCrossentropy(), optimizer = optimizers.Adam(), metrics = [tf.keras.metrics.SparseCategoricalAccuracy()] ) return model

We build the model as before.

model_c = build_model_C(base_model) history_c = model_c.fit( train_generator, epochs = 20, steps_per_epoch = 100, validation_data = validation_generator, validation_steps = 50 )

As you may have noticed, I disabled early, stopping to train for the entire 20 epochs. But that would not have been necessary. Both training and validation accuracy hit values >99% after just one epoch and stayed there until the end of training. It seems the pretrained weights were already pretty well attuned to distinguishing between cats and dogs.

There are only very few cases where it makes sense to build and train a neural network from scratch. Whenever you have a use case that involves standard images or objects, I recommend using a pretrained architecture. That way, you can get pretty good performance with only a few hundred or even dozens of pictures at much shorter training times.

]]>In this post, we will develop a foundational understanding of deep learning for image classification. Then we will look at the classic neural network architectures that have been used for image processing.

**Image classification in deep learning refers to the process of getting a deep neural network to determine the class of an image on the basis of a set of predefined classes. Usually, this requires the network to detect the presence of certain objects or backgrounds without the need to locate these objects within the image.**

For example, a neural network trained to classify images by the type of pet they contain would be able to detect the presence of a cat in an image and distinguish a cat from other pets.

Early image processing techniques relied on understanding the raw pixels. The fundamental problem with this approach is that the same type of object can vary significantly across images. For example, two pictures of cars might be taken in by cars of different colors and at different angles. It is extremely difficult for a method that relies on understanding single pixels to generalize across the large variety inherent in pictures of objects.

Furthermore, it is extremely inefficient to try to understand single pixels without the surrounding context.

This inefficiency and lack of context are some of the main reasons why traditional fully connected networks are inappropriate for most image classification tasks.

To feed an image to a fully connected network, you would have to stack all the pixels in an image in one column.

Individual pixels are taken out of the context of the surrounding pixels and are fed individually to a neuron that will perform a classification. If you had a grayscale image of 512×512 pixels, you would need 512×512 = 262144 neurons just in your first layer to classify every pixel. Since you connect each pixel to each pixel in the next layer, your number of parameters will explode into millions or billions just for a simple image classification model.

This approach works for simple tasks like distinguishing between 10 handwritten digits. But for anything more complex, you want a method that can incorporate the context, and that can also reuse information across pixels.

Instead of classifying individual pixels directly, convolutional layers slide filters over the image. These filters can understand pixels in the context of the other surrounding pixels to extract features such as edges. The network then only learns to detect the presence of the feature. Non-linear activation functions are used to make detection decisions resulting in a feature map that stores information about the presence of the feature in the image and its correspondence to the filter.

This process entails two main advantages. Firstly, filters are not limited to one group of pixels. They can be reused across the entire image to detect the presence of the same feature in other areas of the image. The result is that parameters are shared and reused.

Secondly, the output of a filter only connects to one field in the parameter map of the next layer as opposed to connecting every node in one layer to every node in the next layer. As a consequence, the number of parameters is dramatically reduced, resulting in sparse connections.

Note that convolutional neural networks do have fully connected layers. You usually find them in the latter part of the network because they ultimately lead into the final classification layer that produces the decision of what class the image belongs to. By the time you reach the later layers, the feature maps produced by the convolutional layers will have reduced the original image to a limited set of features so that a fully connected layer can reasonably be applied to distinguish between those features and decide whether they constitute the desired object.

LeNet is one of the earliest convolutional neural networks that was trained in 1990 to identify handwritten digits in the MNIST dataset.

It only consists of 5 layers (counting a pooling and a convolutional layer as one layer) and has been trained to classify grayscale images of size 32×32. Due to the comparatively small number of parameters and simple architecture, it is a great starting point for building an intuitive understanding of convolutional neural networks.

Here is an illustration of the network architecture from the original paper.

Let’s briefly walk through the architecture.

- We use a 32x32x1 image as input. Since the images are grayscale, we only have one channel. The first convolutional layer applies 6 5×5 filters. As a result of sliding the 6 5×5 filters over the input image, we end up with a feature map of 28x28x6. If you don’t understand why this is th case, I recommend reading my article on convolutional filters first. The convolutional layer uses a tanh activation function.
- After the first convolutional layer we have an average pooling layer with a filter size of 2×2 and a stride of 2. The resulting feature map has dimensions 14x14x6. Again, if you couldn’t follow that reasoning, check out my article on pooling.
- The network has another convolutional layer that applies 16 5×5 filters as well as a tanh activation function resulting in an output map of 10x10x16.
- The convolutional layer is followed by another pooling layer with a filter size of 2×2 and a stride of 2 producing a feature map of 5x5x16
- We have a final convolutional layer with 120 filters of size 5×5 and a tanh activation resulting in a feature map of 1x1x120, which connect neatly to a fully connected layer of 120 neurons.
- Next, we have another fully connected feature map of 84 neurons. The fully connected layers both have tanh activation functions.
- Finally, we have the output layer that consists of 10 neurons and a softmax activation function since we have a multiclass classification problem with 10 different classes (one for each digit).

As you can see, the network gradually reduces the size of the original input image by repeatedly applying a similar series of operations:

- Convolution with a 5×5 filter
- Tanh activation
- Average pooling with a 2×2 filter

These operations result in a series of feature maps until we end up with a 1x1x120 map. This allows us to make the transition to fully connected layers while the rest of the neural network operates like a traditional fully connected network.

Note that the convention of using tanh as an activation function between layers and applying average pooling is not common in modern convolutional architectures. Instead, practitioners tend to use ReLU and max pooling.

AlexNet initiated the deep learning revolution in computer vision by pulverizing the competition in a 2012 computer vision contest. The network is significantly larger than LeNet, which partly explains why there was a gap of more than 20 years between LeNet and AlexNet. In the 1990s, computational power was so limited that training large-scale neural networks wasn’t feasible.

The number of classes AlexNet was able to handle compared to LeNet also increased significantly from a mere 10 to 1000. Consequently, it was also trained on a much larger dataset comprising millions of images.

AlexNet can process full RGB images (with three color channels) at a total size of 227x227x3.

AlexNet relies on similar architectural principles as LeNet. It uses 5 pairs of convolutional layers and pooling layers to gradually reduce the size of the feature maps along the x and y axes while increasing the filter dimension. Ultimately, the last feature map connects to a series of 3 fully connected layers resulting in an output layer that distinguishes between 1000 classes using a softmax activation.

Compared to LeNet, the AlexNet architecture also featured some innovations:

- ReLU activation function which accelerated learning signifcantly compared to tanh
- Applying max pooling instead of average pooling
- Overlapping pooling filters to reduce the size of the network which also resulted in a decrease in error
- Dropout layers after the last pooling layer and the first fully connected layer to improve generalization and reduce overfitting.

The full feature map transformation across the layers looks as follows.

For training, the architecture was split between two GPUs, with half of the layers being trained on one GPU and the other half on the other GPU.

While AlexNet marked a quantum leap in the development of convolutional neural networks, it also suffered from drawbacks.

Firstly, the network had approx. 60 million parameters which make it not only large but also extremely prone to overfitting.

Compared to modern neural network architectures, AlexNet was still relatively shallow. Modern networks achieve great performance through increasing depth (which comes with its own drawbacks, such as exploding and vanishing gradients).

Alexnet applied unusually larger convolutional filters of up to 11×11 pixels. Most recent architectures tend to use smaller filter sizes such as 3×3 since most of the useful features are found in clusters of local pixels.

VGG 16 marked the next large advance in the construction of neural network architectures for image processing. Arriving in 2014, it achieved a 92.7% test accuracy in the same contest that AlexNet had won 2 years prior.

VGG 16 is based on AlexNet, but it made some critical changes. Most notable is the increase in depth to 16 layers and the simplification of the hyperparameter tuning process by using only 3×3 filters.

While there is no definitive answer to this question, several theories have been proposed as to why depth improves the performance of neural networks.

The first explanation relies on the fact that neural networks attempt to model highly complex real-world phenomena. For example, when trying to recognize an object in an image, a convolutional neural network assumes that there must be some function that explains how the desired object differs from other objects and attempts to learn this function.

Each layer in a neural network represents a linear function followed by a non-linear activation to express non-linearities in the data. The more of these layers you combine, the more complex the relationships that the network can model.

Another argument stipulates that more layers force the network to break down the concepts it learns into more granular features. The more granular the features, the more the network can reuse them for other concepts. Thus, depth may improve generalizability. For example, when training a convolutional neural network to recognize cars, one layer in a relatively deep network may only learn abstract shapes such as circles, while a layer in a relatively shallow network may learn higher-level features that are specific to a car, such as its wheels.

For convolutional neural networks, another idea suggests that increasing depth results in a larger receptive field. Since you are stacking multiple small filters on top of each other where each summarizes the content in the preceding filter, the layers later in the network will have a higher bird-view perspective on the image. If you used a shallow network, the layers would need larger filters to capture the entire image.

In practice, most neural network architectures rely on 3×3 or 5×5 filter sizes. Larger filter sizes such as the 11×11 used in AlexNet have fallen out of favor.

Firstly, they result in a larger increase in the parameter space. A filter of size 11×11 adds 11×11 = 121 parameters, whereas a 3×3 filter only adds 9 parameters. Stacking 2 or 3 3×3 filters only results in a total of 18 or 27 parameters which is still much less than 121.

Secondly, each filter comes with a non-linear activation function of the corresponding layer. Stacking multiple filters with non-linear activation functions for each one enables the network to learn more complex relationships than a single larger filter with only one activation function.

The VGG 19 Network takes a picture of size 224x224x3 as input. The structure of the convolutional layers is the same throughout the network. They all use 3×3 filters and same padding with a stride of 1 to preserve the dimensions of the input.

The image is passed through 2 convolutional layers with 64 3×3 filters followed by a Max Pooling layer. The pattern is repeated another time with 2 convolutional layers with 128 filters each followed by a max-pooling layer.

Then we have 3 blocks with 4 convolutional layers and a max-pooling layer per block. The number of filters equals 128 in the first block and is increased to 512 in the latter two blocks. Finally, we have a layer to flatten the input, followed by 3 fully connected layers. All convolutional and fully connected layers use the ReLU activation function except for the last one which uses the softmax to distinguish between the 1000 classes.

The VGG16 Architecture is, in principle, the same as the VGG 19 architecture. The main difference is that the 3rd, 4th, and 5th blocks only have 3 convolutional layers.

]]>In this post, we will cover how to build a simple neural network in Tensorflow for a spreadsheet dataset. In addition to constructing a model, we will build a complete preprocessing and training pipeline in Tensorflow that will take the original dataset as an input and automatically transform it into the format necessary for model training.

*Note: If you already have a jupyter notebook with Tensorflow and Python up and running, you can skip to the next section.*

Before we can build a model in TensorFlow, we need to set up our machine with Python and Tensorflow. I assume that you already have Python installed along with a package manager such as pip or conda. If that isn’t the case, you need to do that before you can continue with this tutorial. Here is a great tutorial on installing Python.

Navigate to the directory where you want to work and download the Titanic Dataset from Kaggle to your working directory. Unzip the package. Inside you’ll find three CSV files.

It is generally good practice to set up a new virtual Python environment and install Tensorflow and your other dependencies into that environment. That way, you can manage different projects that require different versions of your dependencies on the same machine. Furthermore, it is much easier to export the required dependencies.

I am using conda for managing my packages. Open a terminal window (I’m on macOS, on Windows, use an alternative like Putty) and navigate to your working directory.

cd path/to/your/working/directory

Next, I am telling conda to create an environment with Python 3.8.

conda create --name titanic python=3.8 conda activate titanic

Now we need to install our dependencies. Besides Tensorflow, we install some basic Python packages for data manipulation as well as jupyter to run Jupyter notebooks. The standard pip installation of Tensorflow will give you the latest stable version with both GPU and CPU support. For this tutorial, we won’t need GPU support, but once you are running larger workloads, I highly recommend it.

pip install jupyter pandas matplotlib scikit-learn, seaborn pip install tensorflow

I ran into an error while importing NumPy with the standard pip installation. At the time of this writing, the latest NumPy package seems incompatible with Python 3.8, so I did another pip install specifying an earlier version.

pip install numpy==1.21.5

Start the jupyter notebook server by typing the following command into your terminal.

jupyter notebook

Create a new jupyter notebook, open it, and import the libraries we’ve installed in the previous section.

import numpy as np import pandas as pd import seaborn as sns import tensorflow as tf

We will just import the train.csv file as a pandas data frame from the titanic data directory that we’ve downloaded before.

data= pd.read_csv('titanic/train.csv')

We first look at the data to understand its shape.

print(data.shape) data.head()

This is a record of 891 people who traveled on the Titanic when it sank. We want to predict who survived based on the remaining features in the dataset.

Let’s plot the variable “Survived” to see how it is distributed (how many people perished and how many survived).

sns.countplot(data=data,x='Survived')

It seems like more than 500 people died, and somewhat over 300 survived. It is important to keep in mind that the classes are not balanced because the model could achieve well over 60% accuracy by simply predicting that everyone died.

Next, we will use our own intuition to filter the data and select the relevant features. When looking at the data, I’ve guessed that class, fare, age, and sex are probably important predictors. People who traveled in the first class and paid a higher fare were given priority over people in the lower classes. Women and children were probably also more likely to be saved. Of course, this is just my intuition.

*If your goal is to find the best model possible, you wouldn’t just rely on your intuition but do more preliminary analysis, tinker around with the data and investigate if there are correlations between survival and other features.But our goal here is to build a neural network and an input pipeline that automates the whole process. For this tutorial, we don’t care about achieving the highest possible accuracy.*

I’m going to filter the data for the selected variables.

data = data[['Survived', 'Pclass', 'Sex', 'Age', 'Fare']]

Next, we want to get rid of all null values in the data. Pandas makes this easy.

data = data.dropna() print(data.shape) #(714, 4)

*If you want to achieve the highest possible accuracy, you might not want to simply drop the values that contain a null entry since you are throwing away a lot of useful information in the other features.*

Before we can start building the input pipeline, we want to separate the predicted variable from the predictors.

target = data.pop('Survived')

A large part of any machine learning project is spent on data manipulation and preprocessing before the data is suitable for training. If you want to deploy a model into production, you need to automate the process of preprocessing and transforming the data into a format suitable for model training.

If we look at the remaining predictors, we see that Pclass is a categorical feature expressed in integer values. There are a total of three classes, and a passenger falls into one of them. Sex is another categorical variable because passengers are either classified as male or female, and the classification is expressed in strings. The age and fare columns contain continuous numeric values.

Before we feed the values to a neural network, we should normalize the numeric values, and one-hot encode the categorical variables.

Normalization ensures that all numeric features are on a similar scale from 0 to 1, which helps speed up gradient descent. If you are unclear why normalization helps gradient descent, check out my post on data normalization.

One hot encoding creates a new column for every category in a variable. It then fills all row entries of each column with zeros except for the rows that actually contain that category.

For example, if. one row in the original column contained the entry “male”, and the second contained “female”, we would end up with two columns named “male”, and “female” with 1s in the row that contained the original category.

This may look like an unnecessary complication. We one-hot encode data because neural networks need numeric data to learn. But why can’t we just map every string category to a corresponding integer? It is possible in cases where a natural ordering exists in the data. But the network may infer an order that doesn’t exist in this case. For example, we don’t want the network to conclude that males are preferable to females. Giving every category its own column essentially binarizes the problem, which removes the potential for implicit ordering.

The survival status, our predicted variable, is another categorical feature expressed in integer values. If the person has survived, it is indicated by the value 1. Death is indicated by a 0. To be precise, survival is a binary feature, and since it only contains 0s and 1s, we don’t need to apply any transformation.

So we split our feature columns between categorical features, numeric features, and the predicted variable.

categorical_feature_names = ['Pclass','Sex'] numeric_feature_names = ['Fare', 'Age'] predicted_feature_name = ['Survived']

To feed the data frame to TensorFlow specific functions, we need to process it into a data object that Tensorflow can work with. To achieve this, we define a Keras tensor for every column in our dataset that acts as a placeholder for the actual data we pass at runtime.

In the following function we:

- iterate through the columns,
- identify the datatype of the column and assign a corresponding tensorflow datatype,
- create the tensor with the datatype and a yet undefined shape,
- pull the tensors into a dictionary and uniquely identify each by the name of the column.

def create_tensor_dict(data, categorical_feature_names): inputs = {} for name, column in data.items(): if type(column[0]) == str: dtype = tf.string elif (name in categorical_feature_names): dtype = tf.int64 else: dtype = tf.float32 inputs[name] = tf.keras.Input(shape=(), name=name, dtype=dtype) return inputs inputs = create_tensor_dict(data, categorical_feature_names) print(inputs)

Before we normalize the numeric features, we define a helper function that converts the numeric columns from our pandas’ data frame into tensors of floats and stacks them together into one large tensor.

def stack_dict(inputs, fun=tf.stack): values = [] for key in sorted(inputs.keys()): values.append(tf.cast(inputs[key], tf.float32)) return fun(values, axis=-1)

Next, we create the normalizer. We first filter the numeric feature columns from the dataframe. Then we instantiate Keras’ inbuilt normalizer, tell it to normalize along the last axis of our tensor, and adapt the normalizer to our features which we’ve converted to a dictionary, and then to a tensor using the helper function.

def create_normalizer(numeric_feature_names, data): numeric_features = data[numeric_feature_names] normalizer = tf.keras.layers.Normalization(axis=-1) normalizer.adapt(stack_dict(dict(numeric_features))) return normalizer

Finally, we perform the actual normalization in a new function to which we pass the normalizer. In the previous function, we’ve told the normalizer to expect a stacked dictionary of tensors. Previously, we’ve defined a dictionary that contains placeholder tensors for the model. At runtime, the model will expect an input of that same format. Therefore, we need to bring the dictionary into a format that our normalizer can process.

If we filter the dictionary down to just the placeholders for the numeric features and stack them using the stackdict function, we can pass them to the normalizer.

def normalize_numeric_input(numeric_feature_names, inputs, normalizer): numeric_inputs = {} for name in numeric_feature_names: numeric_inputs[name]=inputs[name] numeric_inputs = stack_dict(numeric_inputs) numeric_normalized = normalizer(numeric_inputs) return numeric_normalized

Now we can create the normalizer and subsequently normalize the data using the two functions. This will result in a tensor in which the numeric features have been stacked along the y-axis. Since our data has two numeric feature columns, the y-axis will have 2 entries.

normalizer = create_normalizer(numeric_feature_names, data) numeric_normalized = normalize_numeric_input(numeric_feature_names, inputs, normalizer) print(numeric_normalized)

Ultimately, we want to collect all preprocessed features together so that we can feed them to the model as one dataset. We start by creating an empty list of preprocessed features, to which we gradually add the features as we process them.

preprocessed = [] preprocessed.append(numeric_normalized)

We’ve processed the numerical features. As a next step, we want to one-hot encode the categorical features. Remember that we have categorical features that are encoded either as strings or as integers. Specifically, we need to distinguish between 3 classes and 2 sexes.

In the following function, we iterate through the columns in our dataframe that contain categorical features. We then check if the data contained is of type string or of type integer. In the first case, we define a lookup function to convert the column into a one-hot encoding on the basis of the contained string values. In the second case, we define an integer lookup function to achieve the same thing.

We then retrieve the placeholder from the inputs dictionary corresponding to the column, apply the previously defined lookup function to it, and retrieve the resulting one-hot encoding of the column.

Ultimately, we collect all one hot encodings in a list that we return from the function.

def one_hot_encode_categorical_features(categorical_feature_names, data, inputs): one_hot = [] for name in categorical_feature_names: value = sorted(set(data[name])) if type(value[0]) is str: lookup = tf.keras.layers.StringLookup(vocabulary=value, output_mode='one_hot') else: lookup = tf.keras.layers.IntegerLookup(vocabulary=value, output_mode='one_hot') x = inputs[name][:, tf.newaxis] x = lookup(x) one_hot.append(x) return one_hot

The result is a list of tensors of categorical one-hot encoded features. We can, therefore directly add it to the list of preprocessed tensors.

one_hot = one_hot_encode_categorical_features(categorical_feature_names, data, inputs) preprocessed = preprocessed + one_hot print(preprocessed)

We now have a list of 3 tensors. But we want to feed one large tensor to the model, which is why we concatenate the results along the y-axis.

preprocesssed_result = tf.concat(preprocessed, axis=-1) print(preprocesssed_result)

*Tensorflow automatically numbers the layers. Depending on how often you have executed the functions, you will see different numbers in the layer names. *

We are now in a position to build a model that automates the preprocessing on the basis of the functions we’ve defined before.

To instruct Keras to perform the preprocessing, we construct a new model. Then we have to pass the input in the expected format (the dictionary of tensor placeholders defined in the beginning) and the output in the expected format. The model now automatically applies all the intermediate functions defined before that are necessary to go from the input to the expected output.

preprocessor = tf.keras.Model(inputs, preprocesssed_result)

To test that the preprocessor works, we pass it the first row of our dataset converted to a dictionary. That way, it should have the expected format as expressed by the placeholder input dictionary. We should get a 1×9 dimensional tensor (1 because we only passed it one row and 9 because we have a total of 9 columns).

preprocessor(dict(data.iloc[:1]))

We use Keras’ sequential API to define the neural network. Since the use case is spreadsheet data, a simple feedforward multilayer perception should be enough. We use two dense hidden layers with 10 neurons each and apply the conventional ReLU activation function. In the output layer, we use the sigmoid activation function since we have a binary classification problem.

network = tf.keras.Sequential([ tf.keras.layers.Dense(10, activation='relu'), tf.keras.layers.Dense(10, activation='relu'), tf.keras.layers.Dense(1) ])

Next, we tie the preprocessor and the network together into one model.

x = preprocessor(inputs) result = network(x) model = tf.keras.Model(inputs, result)

Take note of the brilliant simplicity of the Keras API: The network and the preprocessor can each be defined as separate models. We can simply link them together into one unified model by passing the output from the preprocessor as an input to the network.

Finally, we compile the model using the Adam optimizer (we could also use another one, but Adam is generally a good default), binary cross-entropy as the loss function (because we have a binary optimization problem), and accuracy as the evaluation metric.

model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy'])

We finally convert the data to a dictionary, so it has the expected format expressed by the placeholder dictionary and fit the model. Feel free to use a different batch size and number of epochs.

history = model.fit(dict(data), target, epochs=20, batch_size=8)

We’ve achieved an accuracy of 80%, which isn’t so bad. We could probably nudge that up a little bit by some more sophisticated feature engineering and playing with the model architecture.

You may have realized that we only trained the model on the training set. In the following function, we split the data between train and test based on a specified proportion. We chose an 80/20 split and fit the model.

def create_train_val_split(data, split): msk = np.random.rand(len(data)) < split train_data = data[msk] val_data = data[~msk] train_target = target[msk] val_target = target[~msk] return (train_data, val_data, train_target, val_target) (train_data, val_data, train_target, val_target) = create_train_test_split(data, 0.8) history = model.fit(dict(train_data), train_target, validation_data=(dict(val_data), val_target), epochs=20, batch_size=8)

The validation accuracy is a few percentage points below the training accuracy. This dataset is fairly small, which is why you should expect some fluctuation in the results when you randomly split the data.

Instead of passing data and outcome separately, you can also pull them together into a TensorFlow dataset. This has the advantage that your model fit function is much more concise as you can specify the batch size and shuffle the data directly on the TensorFlow dataset before you call model fit.

#Alternatively train_ds = tf.data.Dataset.from_tensor_slices(( dict(train_data), train_target )) test_ds = tf.data.Dataset.from_tensor_slices(( dict(test_data), test_target )) train_ds = train_ds.batch(8) test_ds = test_ds.batch(8) history = model.fit(train_ds, validation_data=test_ds, epochs=20)

As part of process automation, we also summarize the data manipulation we’ve performed in disparate steps during the initial analysis in a function. We essentially select the columns on the basis of the selected feature names, remove the columns containing null, and split the predicted variable from the predictors. The following functions are those that we have already defined before.

def preprocess_dataframe(data, categorical_feature_names, numeric_feature_names, predicted_feature_name): all_features_names = categorical_feature_names + numeric_feature_names + predicted_feature_name data = data[all_features_names] data = data.dropna() target = data.pop(predicted_feature_name[0]) return (data, target) def create_tensor_dict(data, categorical_feature_names): inputs = {} for name, column in data.items(): if type(column[0]) == str: dtype = tf.string elif (name in categorical_feature_names): dtype = tf.int64 else: dtype = tf.float32 inputs[name] = tf.keras.Input(shape=(), name=name, dtype=dtype) return inputs def stack_dict(inputs, fun=tf.stack): values = [] for key in sorted(inputs.keys()): values.append(tf.cast(inputs[key], tf.float32)) return fun(values, axis=-1) def create_normalizer(numeric_feature_names, data): numeric_features = data[numeric_feature_names] normalizer = tf.keras.layers.Normalization(axis=-1) normalizer.adapt(stack_dict(dict(numeric_features))) return normalizer def normalize_numeric_input(numeric_feature_names, inputs, normalizer): numeric_inputs = {} for name in numeric_feature_names: numeric_inputs[name]=inputs[name] numeric_inputs = stack_dict(numeric_inputs) numeric_normalized = normalizer(numeric_inputs) return numeric_normalized def one_hot_encode_categorical_features(categorical_feature_names, data, inputs): one_hot = [] for name in categorical_feature_names: value = sorted(set(data[name])) if type(value[0]) is str: lookup = tf.keras.layers.StringLookup(vocabulary=value, output_mode='one_hot') else: lookup = tf.keras.layers.IntegerLookup(vocabulary=value, output_mode='one_hot') x = inputs[name][:, tf.newaxis] x = lookup(x) one_hot.append(x) return one_hot def create_train_val_split(data, split): msk = np.random.rand(len(data)) < split train_data = data[msk] val_data = data[~msk] train_target = target[msk] val_target = target[~msk] return (train_data, val_data, train_target, val_target)

Using these functions, we can load our data and build our training pipeline that ends with the neural network.

#Load data and specify desired columns data= pd.read_csv('titanic/train.csv') categorical_feature_names = ['Pclass','Sex'] numeric_feature_names = ['Fare', 'Age'] predicted_feature_name = ['Survived'] #Preprocess dataframe (data, target) = preprocess_dataframe(data, categorical_feature_names, numeric_feature_names, predicted_feature_name) #Create tensorflow preprocessing head inputs = create_tensor_dict(data, categorical_feature_names) preprocessed = [] normalizer = create_normalizer(numeric_feature_names, data) numeric_normalized = normalize_numeric_input(numeric_feature_names, inputs, normalizer) preprocessed.append(numeric_normalized) one_hot = one_hot_encode_categorical_features(categorical_feature_names, data, inputs) preprocessed = preprocessed + one_hot preprocesssed_result = tf.concat(preprocessed, axis=-1) preprocesssed_result preprocessor = tf.keras.Model(inputs, preprocesssed_result) #Define the model network = tf.keras.Sequential([ tf.keras.layers.Dense(10, activation='relu'), tf.keras.layers.Dense(10, activation='relu'), tf.keras.layers.Dense(1) ]) x = preprocessor(inputs) result = network(x) #result model = tf.keras.Model(inputs, result) model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy'])

Now we can split the data frame into training and test set and directly fit the model on the data.

(train_data, val_data, train_target, val_target) = create_train_test_split(data, 0.8) history = model.fit(dict(train_data), train_target, validation_data=(dict(val_data), val_target), epochs=20, batch_size=8)]]>

In this post, we understand the basic building blocks of convolutional neural networks and how they are combined to form powerful neural network architectures for computer vision. We start by looking at convolutional layers, pooling layers, and fully connected. Then, we take a step-by-step walkthrough through a simple CNN architecture.

Layers are the basic building block of neural network architectures. Convolutional neural networks primarily rely on three types of layers. These are convolutional layers, pooling layers, and fully connected layers.

Let’s have a look at each of them.

Fully connected layers are the most elementary layers. They consist of a string of neurons stacked on top of each other.

Each neuron takes x as an input, multiplies it with a weight, and adds a bias.

wx + b

The result is sent through a non-linear activation function such as the ReLU to calculate an output that is sent to the next layer.

relu(wx+b)

**A convolutional layer is a layer in a neural network that applies filters to detect edges and structures in images. By using multiple convolutional layers in succession, a neural network can detect higher-level objects, people, and even facial expressions.**

The main reason for using convolutional layers is their computational efficiency on higher-dimensional inputs such as images. If we wanted to train a fully connected layer to classify images, we would have to roll out the 2D image into a one-dimensional stack of pixels. For example, an image with a dimension of 200×200 would become a stack of 4000 pixels. Our fully connected layer would have to contain 4000 units to learn each pixel value.

In a convolutional layer, we replace the multiplication of x with a weight w with a convolution operation. Instead of w, we use a 2-dimensional filter k that we convolve with our input image x.

For an explanation of how the convolution operation works, check out the post on convolutional filters.

Then, like in a fully connected layer, we add a bias to the result of the convolution c and send the final result through a non-linear activation function such as the ReLU.

a= relu(c+b)

You may have realized that the original image we fed to the convolutional layer had a dimensionality of 6×6, but the output was 4×4. Understanding how the convolution operation changes the dimensions of your input is crucial to getting your convolutional neural networks to work. The change of dimensions in a convolutional layer depends on the size of the filter, the stride, and the padding. The filter changes the image dimensions by a factor of

(m\times m) * (f\times f) = (m-f+1)*(m-f+1)

where m is the image length, and f is the filter length. The stride manipulates the output size by changing the step size with which we move the filter over the image, while the padding enables us to add a border around the image to prevent shrinkage due to the filter size. To understand exactly how these operations influence the size of the output, check out this post on padding, stride, and kernel sizes.

Furthermore, convolutional layers usually slide multiple filters over the image. As a result, their output will contain three dimensions even if you only feed it a 2-d image.

**A pooling layer is a layer in a convolutional neural network that abstracts the features extracted by a convolutional layer and helps make the features invariant to translations. To achieve this, it applies the pooling operation on the output produced by the convolutional layer.**

To learn more about the pooling operation, check out this post on pooling.

The pooling layer is fairly simple. It only slides a filter over the output of the convolutional layer, selects the highest pixel value under the filter (provided we use max-pooling), and produces the output in a multidimensional map p.

The pooling layer also shrinks the output depending on the size of the pooling filter and the stride or step size with which we move the filter. Commonly, the pooling filter has a size of 2 and a stride of 2, resulting in shrinking the image by 50%. So a 6×6 input will result in a 3×3 output.

A convolutional neural network architecture usually consists of a couple of convolutional layers, each of which is followed by a pooling layer. These layers have the purpose of extracting features and shrinking the dimensionality of the output. Towards the end, you will usually find a fully connected layer that rolls out the multidimensional input into one dimension and finally feeds it to the final output layer that is used to return the final classification.

To make this a bit more concrete, let’s have a look at one of the earliest convolutional neural network architectures, the LeNet5.

Compared to modern architectures, the LeNet5 was fairly simple, using 2 convolutional layers, 2 pooling layers, 2 fully connected layers, and an output layer.

LeNet takes a 32×32 grayscale image as an input which means it does not have multiple color channels. The network was trained to recognize handwritten digits on the MNIST dataset. The image is passed to a convolutional layer with 6 filters, a filter size of 5×5, and a stride of 1, and no padding. Due to the six 5×5 filters, the result is a feature map with dimensions equal to 28x28x6. We also call these multidimensional data objects that are passed between layers tensors.

The feature map is passed to a pooling layer. The pooling filter has a dimension of 2×2 and is slid across the 6 channels produced by the convolutional layer using a stride of 2. Accordingly, the feature map shrinks by half along the vertical and horizontal dimensions to 14x14x6. LeNet applied average pooling, while most modern implementations mainly rely on max pooling.

The next convolutional layer applies 16 filters with a size of 5×5, a stride of 1, and no padding. The output along the horizontal and vertical dimensions shrinks by 5-1 to 10×10 while the number of channels of the output map goes up to 16.

The convolutional layer is immediately followed by another average pooling layer using a filter size of 2×2 and a stride of 2. The result is a feature map with a dimensionality of 5x5x16.

The pooling layer is followed by a fully connected layer of 120 neurons. Each of these neurons connects to each of the 5x5x16 = 400 nodes in the previous layer.

Next, we have another traditional fully connected layer that reduces down to 84 nodes.

The fully connected layer finally connects to the output layer containing 10 nodes. Remember that LeNet was trained to distinguish between 10 different digits, which is why we have 10 output nodes.

]]>**Pooling in convolutional neural networks is a technique for generalizing features extracted by convolutional filters and helping the network recognize features independent of their location in the image. **

Convolutional layers are the basic building blocks of a convolutional neural network used for computer vision applications such as image recognition. A convolutional layer slides a filter over the image and extracts features resulting in a feature map that can be fed to the next convolutional layer to extract higher-level features. Thus, stacking multiple convolutional layers allows CNNs to recognize increasingly complex structures and objects in an image.

A major problem with convolutional layers is that the feature map produced by the filter is location-dependent. This means that during training, convolutional neural networks learn to associate the presence of a certain feature with a specific location in the input image. This can severely depress performance. Instead, we want the feature map and the network to be translation invariant (a fancy expression that means that the location of the feature should not matter).

In the post on padding and stride, we discussed how a larger stride in convolution operations could help focus the image on higher-level features. Focusing on the higher-level structures makes the network less dependent on granular details that are tied to the location of the feature. Pooling is another approach for getting the network to focus on higher-level features.

In a convolutional neural network, pooling is usually applied on the feature map produced by a preceding convolutional layer and a non-linear activation function.

The basic procedure of pooling is very similar to the convolution operation. You select a filter and slide it over the output feature map of the preceding convolutional layer. The most commonly used filter size is 2×2 and it is slid over the input using a stride of 2. Based on the type of pooling operation you’ve selected, the pooling filter calculates an output on the receptive field (the part of the feature map under the filter).

There are several approaches to pooling. The most commonly used approaches are max-pooling and average pooling.

In max pooling, the filter simply selects the maximum pixel value in the receptive field. For example, if you have 4 pixels in the field with values 3, 9, 0, and 6, you select 9.

Average pooling works by calculating the average value of the pixel values in the receptive field. Given 4 pixels with the values 3,9,0, and 6, the average pooling layer would produce an output of 4.5. Rounding to full numbers gives us 5.

You can think of the numbers that are calculated and preserved by the pooling layers as indicating the presence of a particular feature. If the neural network only relied on the original feature map, its ability to detect the feature would depend on the location in the map. For example, if the number 9 was found only in the upper left quadrant, the network would learn to associate the feature connected to the number 9 with the upper left quadrant.

By applying pooling, we pull this feature out into a smaller, more general map that only indicates whether a feature is present in that particular quadrant or not. With every additional layer, the map shrinks, preserving only the important information about the presence of the features of interest. As the map becomes small, it becomes increasingly independent of the location of the feature. As long as the feature has been detected in the approximate vicinity of the original location, it should be similarly reflected in the map produced by the pooling layers.

Due to its focus on extreme values, max pooling is attentive to the more prominent features and edges in the receptive field. Average pooling, on the other hand, creates a smoother feature map because it produces averages instead of selecting the extreme values. In practice, max pooling is applied more often because it is generally better at identifying prominent features. In practical applications, average pooling is only used to collapse feature maps to a particular size.

Due to its ability to collapse feature maps, pooling can also help classify images of varying sizes. The classification layer in a neural network expects to receive inputs in the same format. Accordingly, we normally feed images in the same standard size. By varying the offsets during the pooling operation, we can summarize differently sized images and still produce similarly sized feature maps.

In general, pooling is especially helpful when you have an image classification task where you just need to detect the presence of a certain object in an image, but you don’t care where exactly it is located.

The fact that pooling filters use a larger stride than convolutional filters and result in smaller outputs also supports the efficiency of the network and leads to faster training. In other words, location invariance can greatly improve the statistical efficiency of the network.