the-moose-machine

Pandas vs Apache Spark vs Power BI Desktop: big data performance on a single machine

The selection of the right tool for the right job can take some experience and knowledge. Awareness of the right tool can help save a lot of time, energy and cost. Therefore, this post aims to provide some insight in this area which should allow for the selection of a tool that is appropriate forContinue reading “Pandas vs Apache Spark vs Power BI Desktop: big data performance on a single machine”

Create your own Kafka cluster: Data streaming

At some point in your data science/engineering career, you will eventually come across the need to handle streaming data. The challenge with data streams is the velocity and volume of data coming in. Therefore, you will be left with no choice other than to manage it in some way. Apache Kafka is a great toolContinue reading “Create your own Kafka cluster: Data streaming”

Create your own Hadoop cluster: Living in a parallel dimension

At some point or the other, during your data science/data analytics journey you will come across the need to process huge amount of data. During these times, it would be useful to be able to harness the power of parallel processing. In simple words, parallel processing relates to the harnessing of the processing power ofContinue reading “Create your own Hadoop cluster: Living in a parallel dimension”

Data Extraction – Text and Numeric

Once you have located the data, you now have to begin the task of data extraction. Data extraction does not only refer to extracting data from online sources, but it includes offline and non-digital sources also. This post will make an effort to cover some of the methodologies I find useful in this process. ItContinue reading “Data Extraction – Text and Numeric”

Data Acquisition

The first process of any machine learning pipeline starts with the process of Extraction, Transformation and Loading (ETL) of data into the system, which is by far my most favourite part of data science. ETL Basics As the name suggests, ETL comprises of the following parts: Data Extraction: This part deals with the acquisition ofContinue reading “Data Acquisition”

Where do I start? Where do I begin?

First come the questions: what is RNN? What is CNN? By the way, isn’t that an American News channel? What on earth is a Support Vector Machine? What is business intelligence? What is deep learning? Why is it ‘deep’? For that matter, is there anything called ‘shallow learning’? Then comes the resolution: ‘Ok. Let meContinue reading “Where do I start? Where do I begin?”