Once you have located the data, you now have to begin the task of data extraction. Data extraction does not only refer to extracting data from online sources, but it includes offline and non-digital sources also. This post will make an effort to cover some of the methodologies I find useful in this process.
It is likely that most of the data you find will be digitised format. However, the formats may vary from web-based Hypertext Markup Language (HTML) to Portable Document Format (PDF). The following are some techniques for handling some common formats:
Text Data
Text data are usually extracted either from websites or PDF documents.
1. Web Extraction
The extraction of text data from webpages can be done in one of two ways:
- Application Programming Interface (API): Several companies provide an API that can be used by users to access its data easily. APIs are usually customised by each organisation. Therefore, their usage varies as per company policy. In most cases, accessing the API would require the user to register with the company and gain their approval. Various social media organisations provide APIs for usage for this purpose. These include Twitter and Facebook . However, some organisations explicitly refuse access to their API resources for data analysis, for example Tripadvisor. Be careful when looking for APIs as there are several dishonest websites that are out to cheat you. So, always verify whether the website used for accessing the API is the one officially provided by the company.
- Web Crawler: If an API is not available, another method would be to use a web crawler or spider to download data from public web pages. Note: accessing publicly available data is usually subject to copyright laws as was discussed in my previous post on Data Acquisition. It is highly recommended that you seek out the organisation’s copyright policy before starting to avoid any potential trouble in the future. Only continue down this path once you are clear about your legal position.
1.1 Creation of Web Crawler
Unfortunately, no standardised web crawlers exist. As a result, you will be required to create it by yourself. Several tutorials are available that can assist you with this. However, not all of them explain how to create them ethically and legally. It is highly recommended that your crawler respect the privacy and copyright of the website and follow the guidelines provided by the robots.txt file that the website developers provide.
1.2 Working with robots.txt
robots.txt is a text file provided at the root of the website that you wish to access (example www.facebook.com/robots.txt and www.google.com/robots.txt). It provides important information about access to the website. The file is created using regular expressions (RegEx) and it is polite to incorporate it into the crawler as it sets the delimitations on which parts of the website should be accessed by crawlers and search engines. The crawler can definitely work if robots.txt is not included; however this may have consequences regarding copyright infringement. Some tools, like the Python package called scrapy, is built to respect robots.txt out of the box and spares you this additional effort. Some websites are smart enough to know when they are being accessed by crawlers and will automatically block your access as soon as they are triggered.
1.3 URL manipulation
One of the key aspects of the creation of web crawlers is the investigation of the schema of the Uniform Resource Locator (URL) of the targeted website. The data you seek will be located across numerous pages of this website and you will need to develop a program that constructs the URL and loops through it, downloading the page and saving it.
1.4 Webpage extraction
My personal favourite method is using a simple text-based web browser, like w3m, and outputting its content into a text file. I have also had success with a Python 2 package known as urllib2.
The extraction can be as simple as the following Bash command:
$ w3m www.google.com > 1.txt
The command above uses the w3m browser to open the desired webpage (here: www.google.com) and directs its output (represented by the ‘>‘ sign) into a text file (called ‘1.txt‘). Note the usage of numbers for the output file. Naming the output files with numbers will be useful for looping purposes during data manipulation in the future. It is recommended that webpages be saved as plain text instead of HTML to reduce storage requirements. Plain text files are also easier to work with during data manipulation.
1.5 Catching URL errors
Your crawler must account for errors as not all URLs you develop will be legitimate. So, the crawler you develop must be built to catch such errors during runtime and not stop each time it comes across it. For instance:
# This code runs on Python 2 which is now unsupported
import urllib2
try:
website_output = urllib2.urlopen(predetermined_url)
print "success"
except urllib2.HTTPError, e:
print "failure"
Here: urllib2 is the python 2 package that must first be imported. try indicates the start of a block that runs in ‘safe mode’, and except is what happens if the specific error occurs. predetermined_url is the URL that you have developed based on your investigation of the schema of the website. website_output is a variable that stores the webpage you opened. The line below the except code is run, if the error is found (HTTPError indicates that the webpage is not found), otherwise the program continues to the next line after this block. This block can be followed by writing the content of website_output into a plain text file (e.g. 1.txt), if the attempt were successful, and looping back to develop the next predetermined_url regardless of success or failure of this attempt.
1.6 Web extraction summary
So, to summarise, your web crawler should have the following functionality built in:
- Access the
robots.txtfile at the top of the webpage. - Develop URLs according to the schema that you have studied in advance while respecting the guidelines provided by
robots.txt. - Extract the webpage and save it as a plain text file. Catch URL errors if not successful.
- Output the success or failure of each attempt in order to keep track of the crawler’s performance.
- Looping back to step 2 and developing the address of the next URL
2. PDF Extraction
Extracting text out of PDF files is reasonably easier than developing a successful web crawler. The objective is still the same though: extract the data and save it as a plain text file. If the PDF has been created using Optical Character Recognition technology (OCR), the process should be fairly straightforward.
One of the easiest tools for achieving this is pdftotext which is provided by the poppler-utils package which can be installed and used in Ubuntu and its derivatives as follows (check the installation process for your system):
$ sudo apt-get install poppler-utils
$ pdftotext test.pdf 1.txt
Here, the first line of code installs the poppler-utils package, whereas the second line calls the pdftotext application and directs it to convert test.pdf into plain text and save it under the name 1.txt. Note the usage of numbers for naming the plain text files which will be useful for looping during future data manipulation.
3. Proprietary Formats like .docx
Being the most popular word processor of its kind, most of the legacy documents are created using Microsoft’s proprietary formats like .doc and .docx. However, for Natural Language Processing (NLP), it is important to extract the text as plain text as these formats are not easy to work with. Therefore, it becomes necessary to extract their content into a text file. LibreOffice can be used on the command line to convert documents for this purpose on the fly. Its usage is as follows:
$ libreoffice --headless --convert-to "txt:Text (encoded):UTF8" 'Sample File.docx'
$ mv 'Sample File.txt' 1.txt
The first line uses LibreOffice’s command line version (--headless) to extract the all the text from .docx file (‘Sample File.docx‘) and it saves it under the same name as the original file but replaces the suffix .docx to .txt (‘Sample File.txt‘). This requires the renaming of the file using our numbering convention. The second line, starting with mv, does the renaming for us (converts ‘Sample File.txt‘ to ‘1.txt‘ as desired). LibreOffice is able to handle a number of various file types, so it can be used for most common legacy file types.It is my hope that the world will soon move to the XML-based open document standards like .odt or .fodt for office documents soon.
Sometimes you may find that the data would be online but in PDF or DOC/DOCX format. In that case you may have to combine the first method above with the second or third as applicable.
Running Bash Commands under Python
Before completion of this section, I would also like to mention a neat trick using which we can run Linux commands within a Python script. It involves the usage of a Python package known as subprocess and it can be done by passing an list of the command and parameters to its method known as run. A simple ‘hello world’ example can be viewed below:
>>> import subprocess
>>> subprocess.run(['echo','hello','world!'])
hello world!
CompletedProcess(args=['echo', 'hello', 'world!'], returncode=0)
Here echo is the bash output command and hello and world! are the parameters passed to this command. The final CompletedProcess line can be ignored as it only shows up on the shell. A legacy method is subprocess.call which also works but is not recommended.
Numerical Data
The extraction of numerical data is very similar to text, though easier.
1. Web Extraction
Numeric data found online is usually pre-formatted into rows and columns. They are either available in comma-separated value (CSV), tab-separated value (TSV) or Microsoft formats (XLS, XLSX). Larger datasets are available in compressed formats like .ZIP, .RAR and .TAR.GZ. Web extraction usually does not require a web crawler, unless you are collecting a number of different files from various locations, which is uncommon. Numerical data, particularly time-series data can be huge. It is recommended that a command line extraction tool like cURL and wget be employed for this purpose.
$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
or
$ curl -# -o iris.data https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Both the above commands download the Iris dataset from the UCI Machine Learning Repository into the same folder where the command was run.
It is recommended that data be stored in CSV format to save disk space.
2. PDF Extraction
Sometimes numeric data can only be found in a PDF document instead of a spreadsheet. In case you need such data, the only method is to manually copy-and-paste each table into a spreadsheet. Sample tutorials for Microsoft Excel have been provided in this WallStreetMojo article (for LibreOffice press CTRL+Shift+V to bring up the dialogue box and select the ‘Use text import dialog’ option to proceed). The method works reasonably well; however if you have space-separated text information in the columns, then the format is corrupted and not identified accurately. This would mean that you will have to go from row-to-row fixing the errors which will be a highly time-consuming process. Still this method may be faster and more accurate than entering all data manually.
3. Other Formats like .xlsx
Formats like XLS and XLSX may not require to be converted as most languages and data analysis software are capable of reading the files. However, if you wish to save disk space, it would be recommended that the spreadsheets be converted into CSV format. As mentioned earlier, LibreOffice is very capable of converting formats from the command line itself. The usage is as follows:
$ libreoffice --headless --convert-to csv ./iris.xlsx
The above converts the Iris dataset into CSV format and saves it in the same location. A output parameter can also be passed if you wish to save it in another location as follows:
$ libreoffice --headless --convert-to csv --outdir ~/Documents ./iris.xlsx
This saves the CSV file in the ‘Documents’ folder. The ‘headless’ version of LibreOffice is excellent for usage within scripts.
Image courtesy: “File:Alambic à double fond pour la distillation des marcs de raisin.jpg” by El Bibliomata is licensed under CC BY 2.0