Where do I start? Where do I begin?

First come the questions: what is RNN? What is CNN? By the way, isn’t that an American News channel? What on earth is a Support Vector Machine? What is business intelligence? What is deep learning? Why is it ‘deep’? For that matter, is there anything called ‘shallow learning’?

Then comes the resolution: ‘Ok. Let me see what this artificial intelligence thing is all about?’ And that brings you here. Perhaps you have tried reading some material and the content seems overwhelming and you are no wiser than where you were when you started. This post intends to perhaps give you an inclination about the foundation of machine learning. Hopefully it provides some basic idea about what is going on and which direction you should follow if you are keen to try your hand at it.

GIGO

The foundation of machine learning, as for everything else in the world, is Garbage-In Garbage-Out (GIGO), also known as ‘there are no stupid answers, only stupid questions‘. In other words, if you are providing rubbish, you will receive rubbish in return. Please do not expect to magically discover gold instead. It is not going to happen. Gold production takes a lot of hard work and a lot of rubbish has to first be sorted out before you can begin the process of gold extraction… and you may not get much of it in the end. So, if you are happy to spend most of your time fighting against rubbish and figuring out what use can be made out of it, then that is a good start. There will be no miracles, you get what you put in.

Data is Dirty

Real-world data is dirty. Why? Because no one thought that any further use could be made out of it than what it was designed for. For instance, many, many years back when passing judgements in court, no judge ever thought that perhaps one day in the near or distant future, some people may wish to investigate all my cases and those of my colleagues and every other judge in history to look for patterns within the judgement decision. To the best of their knowledge, their reports were meant for human consumption, not machines. So off they went creating documents in whichever format that made sense to them and others followed in their footsteps. For a long time, this worked well on a human level; however the documents were unsuitable for computers which were far more efficient in processing and analysing data than humans. This is where we are currently. It is our job to fight with this data and convert it into a format that would make it easier for computers to understand. Hopefully some day in the near future, all documentation would be stored and communicated in a standardised format, like JSON or XML. However, that day hasn’t yet arrived.

Data is Biased

As was mentioned in the GIGO section of this page, if the input has any issues, so will the output. You may have come across instances where algorithms have made seriously bad decisions. Some of the decisions like recommending terrible movies and promoting poor quality videos could at best be harmless and hilarious. However, when these same algorithms are used to make biased and unfair decisions about people, like determining student exam results, making unfair decisions for visa applications and making incorrect predictions for violent crimes; these can have serious repercussions on people’s lives and careers. How could this be resolved? One way would be by identifying the biases in advance and selecting the data as judiciously as possible. This is a very difficult task as the real world is not perfect; although ‘data scientists’ are expected to treat it like one. That is a hard job. There are unforeseen issues and challenges which show up in the results as time goes by. These issues force us to retrace our steps and go back to basics to minimise the damage caused.

Data Types

The cleaning of data depends on the data itself and each type of data needs a specific type of cleaning. The most common data types include:

Numerical Data: Most common data type. Includes data in spreadsheets. Usually provided in tabular format.
Image Data: Visual format like photographs and videos.
Textual Data: The data is usually alphanumeric and in natural human language format. These include textual documents, social media feeds, webpages among others.
Audio Data: Data in audio format. These are usually provided as audio files although they can also be converted into visual format and processed as image data instead.

Tools Required

The following are a list of tools that I, personally, feel are required by anyone wishing to learn the foundations of machine learning:

1. Flowcharts

Before you start with anything, a thorough knowledge about the creation of flowcharts is highly recommended. Flowcharts allow you to logically map out the steps that are required in order to execute a process. Once you have clearly mapped it out, you simply have to convert it into code. Thus coding in any language becomes very easy when we know what exactly we are trying to do at every stage of the process.

Any time that I have to create any script, I always start with a blank paper and pen and start drawing out a flowchart. Flowcharts are an indispensable tool for data cleaning.

2. Computer Languages

The cleaning of data requires the knowledge of a computer language. If you know a computer language, any language, then you are ready to go. In my case, I only knew BASIC which I learned in the 1980’s which was useless in the modern day context. However, I was familiar with Bash, which is not even a language, which is the command line interface I used for managing the Linux operating system on my computers. As I didn’t have the time to learn a new language when working on my first machine learning project, I did the data cleaning and data processing with Bash scripting alone. Although not recommended, but this shows that data processing can be done in any language that you are comfortable with.

In case you do not know any coding, both Python and R are good options. Although these languages are not ideal for software development, they are easy enough to read and be understood by people without a coding background. There is also a general movement in the artificial intelligence community that all coding be done in Python in order to lower the entry criteria. So, if you look at other people’s code, there is a very high likelihood that it will be in Python. People in the open source community have worked tremendously hard to create packages for that language. Several of theses packages are based on far more efficient languages like C but wrapped inside a Python wrapper. So we get the best of both worlds with Python: fast processing and easy comprehension.

3. Packages

There are an excellent number packages available for both languages. These packages can be used for data cleaning, processing and modelling. Here, I am recommending Python packages in order to keep things simple. Perhaps I may do an R-specific blog in the future. Note that R has several of these functionalities built in. The following Python packages are recommended for the various types of data mentioned earlier. Links are provided for each of them.

Numeric data: Pandas, NumPy, statsmodels and scikit-misc.
Image data: torchvision, pillow
Textual data: RegEx is your one-stop shop.
Audio data: wave, scipy particularly the scipy.io package, librosa

4. Platforms

Now comes the time to do some actual data mining. This has two major approaches:

4.1 Shallow Learning

In most cases, the knowledge of flowcharts, Python and its packages will be sufficient to start building machine learning models. The scikit-learn package, in particular, contains all the tools to start building ‘shallow learning’ models including Support Vector Machine, Random Forest, Naive Bayes, k-Nearest Neighbours, k-Means clustering amongst many others. Shallow learning is usually used for analysing numerical data.

4.2 Deep Learning

As you progress you will want to start investigating Artificial Neural Network-based models including Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN). These models, also known as ‘deep learning’, require a lot of computational power and a serious upgrade to the computer hardware. The usage of a Graphics Processing Unit (GPU) by NVIDIA is highly recommended. If you wish to analyse image, sound or text data, deep learning is the way to go.

As of the present day, the following platforms are available for this purpose:

PyTorch: created by Facebook.
TensorFlow: created by Google.

Both are equally good; however, in recent times PyTorch development has overtaken that of TensorFlow. Between the two, I personally find PyTorch code easier to follow, but that is my personal preference. TensorFlow does allow a lot of flexibility and tweaks. A TensorFlow branch, known as Keras, is equally popular and easy to follow. However, when it comes to making refinements, I often find myself reverting back to TensorFlow code.

5. Cloud Providers

A familiarity with cloud-based platforms is highly recommended. Personally I do most of my work on the cloud now. Why? It is a lot cheaper than buying hardware and you can always cut down usage and save money when heavy processing is not required. Finally, cloud-based platforms are also easier to deploy when working for a client in the industry.

Key cloud providers include:

Google Colaboratory: An excellent option for learning. Google provides free access to their resources for all Google account holders. They also provide a GPU for free which is very useful for deep learning, particularly with PyTorch. The only issue is that your data and the models you create will not be stored by default. So you will have to link up your Google Drive and save over there.
Kaggle: Kaggle also provides a free cloud platform and allows you to create and run notebooks. They too provide a GPU for free.
Amazon Web Services: They have some excellent tools for developing, training and deploying machine learning-based models. However, this platform and ones that follow in this list are mostly paid options. Go for them, if you wish to spend money. You will develop familiarity with the platform which would make it easier when you start your career in this field later, if you are so inclined. Alternatively, if you get access to any of these platforms through your workplace, then that would be great. Ask your management and take up that opportunity, if available.
Microsoft Azure: Microsoft has been aggressively developing their cloud platform in recent years. They have some excellent tools also and have a nice user inteface where you can drag-and-drop the services you want in order to build your system. The advantage with Azure is that most companies are still tied to Microsoft products and they are happy to stay within their ecosystem. They provide some free artificial intelligence services for up to a year including storage and some analytical tools. Azure tutorials are available here.
Digital Ocean: They have some good quality cloud services. Their advantage lies in their pricing which is very competitive. However, at the present time, I did not see them offer a GPU. Although this may make model training time-consuming, model deployment becomes easy as the deployed model does not need as much processing power as when training.
Google Cloud Platform: Google’s cloud platform offers a plethora of services and good integration with existing Google products.

There are a few others that you may choose from. However, if you are starting newly and do not have access to any of the paid services, I do not recommend spending money. Google Colab and Kaggle are perfect to learn the ropes.

6. Data Visualisation & Dashboarding

Finally, we come to data visualisation and dashboarding. This is where your marketing skills kick in. Why are these so important? We must remember that any technological solution, or academic research for that matter, is created for the consumption of non-technical people. These include company directors, customers, operational staff members, the general public among others. I am in no way trying to say that these people do not have the knowledge or skills that you have; many of them certainly do. However in several cases they do not. Since it is crucial for us to be able to sell them our ideas and findings, visualisations and dashboards have to be designed keeping the user in mind.

6.1 Data Visualisation

Most people like looking at simple-to-understand graphics. Even though complicated three-dimensional graphs may be your thing, several people may find it distancing. They would rather see a simple two-dimensional bar graph instead. I am aware that we may have a strong desire to use the latest and fanciest visual graphic tools; however, we must remember that the audience, in most cases, are the ones with the money and/or decision-making authority. Both shiny graphics as well as bland academic charts have their audiences. It is prudent to develop an understanding of who the audience is and design for them. As a simple rule of thumb, keep the narrative simple and focused on the point when presenting your ideas.

The following are some popular Python packages which will give you a head start:

matplotlib: The most popular and standardised data visualisation package.
seaborn: Based on matplotlib but with the ability to render complex visualisations.
plotly: Allows for the creation of high quality and interactive visualisations. Very good for giving your visualisations that additional polish.

6.2 Dashboarding

Dashboarding is the process of monitoring the data using interfaces. This is required for those data and results that change regularly. Dashboards are created in order to provide a front-facing interface that is to be used by the users using your deployed model. Like data visualisation above, dashboarding is designed keeping the user and the use case in mind.

In my view, a good grasp of visualisation and dashboarding techniques is just as important, if not more, as is the creation of machine learning models. So experiment with a few of these tools and see which ones you like and which one works best for a given audience.

Free Online Courses

There are plenty of free resources you can use to begin your learning journey. However, the following are a few that I personally liked:

Kaggle: It has some excellent free tutorials covering a lot of the topics listed on this page. This will be a good place to start.
FastAI: The Practical Deep Learning for Coders is an excellent free course provided by FastAI. They also have a very good deep learning library for Python which allows for the creation of high quality models. The only issue I have with this library is that it focuses very highly on the reduction of the lines of code. Although this may be useful for beginners, on most occasions, the models may need tweaking to improve their efficiency. However, the code may be buried deep inside the packages and finding the source results in a lot of time wastage for me. The second issue I have with the fastai package is the lack of usage of the PEP 8 style within their code which makes it difficult to follow. However, my nit picky views have nothing to do with Jeremy Howard’s course which is excellent and highly recommended.
Machine Learning by Stanford University: The free machine learning course by Andrew Ng at Standford University is an excellent hands-on course for an advanced learning of the fundamentals of deep learning. You will be creating the algorithm by using linear algebra formulae and implementing it. The course is time-consuming but you will develop a clear understanding of the simplicity and beauty of the math. The course will be useful if you are inclined towards academic literature and wish to learn the basics for the development of your own algorithm. Top quality stuff but for advanced knowledge. Let this not be the first course you register for.

Summary

So, to summarise, I have provided a simple list above with working links (as of the date of publication) for some of the tools that you need when you begin your journey in the field of data mining. Hope you find it useful. Did you find any excellent sources that you feel others need to know about? Do provide your views and opinions below. I look forward to reading them.

_{Image courtesy: “Artificial Intelligence & AI & Machine Learning” by mikemacmarketing is licensed under CC BY 2.0}