The first process of any machine learning pipeline starts with the process of Extraction, Transformation and Loading (ETL) of data into the system, which is by far my most favourite part of data science.
ETL Basics
As the name suggests, ETL comprises of the following parts:
- Data Extraction: This part deals with the acquisition of the data from the source. Data sources can vary from offline paper-based to online digital.
- Data Transformation: This part deals with the transformation of the original data into a format that is suitable for analysis.
- Data Loading: This deals with the loading of the transformed data into the system. This stage is followed by the model training process.
Both, extraction and transformation, are time-consuming processes and a lot of the data scientist’s time is spent fighting with the data. The challenge and the enjoyment of ETL lies is the knowledge that every crafty issue and challenge has a solution. It is up to us to find the most suitable and efficient method of arriving at a solution that works.
Challenges with Data Acquisition
As was mentioned earlier, the goal of data extraction is to obtain the data. It sounds simpler that the process actually is. The following are some of the major issues surrounding data acquisition:
- Access: Although a lot of data is available in the public domain, the vast majority is hidden either behind closed doors or copyright licences. Therefore, gaining access to these sources may not be simple.
- Legality: Currently, most of the webpages on the Internet display copyright notices. These notices render unauthorised usage of the data within those pages as illegal. Therefore, best practice dictates that you proactively seek out the organisation’s data usage guidelines and apply for authorisation from the owners before extracting data for usage.
- Ethics: Data analysis ethics are a complex issue. This is discussed with some detail in a later section.
A great example can be seen with the MIMIC data set provided by the Massachusetts Institute of Technology. Due to the sensitive nature of the data, people wishing to access it are required to complete a free online certification course where a lot of the ethical and legal concerns are addressed. More details on this can be viewed on their website.
Sources of Data
Data sources may vary depending on the type of data. Sources may fall under one of the following categories among others:
- Open Online Sources: The most commonly used method when learning data science. Open and online sources are also commonly used in academic studies.
- Closed Online Sources: These sources may be available online. In several cases they may also be easy to access; however, their usage will be locked behind strict copyright laws.
- Closed Offline Digital Sources: These sources are only available to those who are provided accessibility by the organisation. They may be stored offline somewhere within the organisation or on the cloud.
- Closed Offline Non-Digital Sources: These sources are amongst the most difficult to process as they are in manual format, that is physical, usually paper-based, documents. Such documents have to first be scanned or transformed into digital format before the processing can take place. A scanner that is capable to recognising text (Optical Character Recognition) is highly recommended for documents.
For your own personal learning and experimentation purposes, the usage of open access data repositories, like the UCI machine learning repository and Kaggle, is recommended. Links to additional datasets and repositiories are provided by the American Psychological Association, by the Carnegie Melon University library, at Visualdata.io and at Google’s dataset search application among others. Please do not forget to cite the source when reporting your findings.
Ethics of Data Analysis
Even if the data are available their analysis and presentation may have ethical issues attached to it. For instance, even if identifying information is removed, it may still be possible to identify individuals based on the information found within the data. Secondly, care must be taken when presenting the results, particularly if it contains information about people or groups of people. Research findings have the potential to be misused for discrimination against individuals and communities for political gain. If you are doing academic research, then you will be covered due to the rigorous ethics reviews before, during and after data analysis. On the other hand when working for a client, such insight may well be desirable as the organisation would need it for the development of their marketing strategies. However, in this case, the results will be inaccessible by the general public and the potentially harmful effects to society may be controlled. It is good to consider that organisations are prone to hacking and other such attacks on their information. So care has to be taken to ensure the security and safety of the findings. Limited access and data encryption software are highly recommended. Meanwhile, if the study is being undertaken independently then the responsibility for the findings and their safety rest completely on your shoulders. Do remember that we do not live in utopia; therefore, when in doubt it would be prudent to have an independent, and hopefully, qualified person having a look at the findings before you decide to go ahead with their publication.
The next post will cover some techniques of data extraction that I find useful.
Image courtesy: “Knabbel went shopping.” by jpockele is licensed under CC BY 2.0
One thought on “Data Acquisition”