AWS S3: Simple Data Mining Techniques

Recently, I was involved in a piece of work to move some fairly large on-site database tables to AWS S3.

Part of the post-upload verification included reconciling record counts and visual inspection of sample data to ensure format was as expected.

Ideally, AWS Athena would have been the user-friendly way of achieving this, however, there were some organisational constraints on accessing additional AWS services, including Athena.

This article focuses on a few alternative methods that can be used to perform simple mining/verification of data in S3 buckets.

Methods Covered

AWS CLI command:

Examples of executing this command from different OS shells ( Windows Powershell, Command Prompt and Linux Bash) will also be included, with the aim of demonstrating some subtle differences that cover:

Prerequisites

Windows 10

Install the AWS CLI, which can be downloaded from here.

Some examples using Windows 10 make use of gzip. If you are interested in running these, then install Git Bash for Windows, which includes the binaries required.

Linux

Use the procedure for your specific Linux distribution to install the awscli tools. Ubuntu is used for examples in this article. Installation on Ubuntu can be accomplished via pip :

pip3 install awscli

After the AWS CLI is installed, configuration with AWS security credentials is required. Quick config instructions are provided in the AWS Documentation.

Creating the Test File (.gz)

A test file, used for the examples, was generated using Mockaroo.

The file contains 1001 mocked-up records (including header), with the fields shown below.

first_name last_name email gender ip_address

Fields are delimited and the file is saved as MOCK_DATA.txt

After saving, the file was gzipped with a final name MOCK_DATA.txt.gz. The final .gz can be downloaded from here.

Filed was upload to the target S3 bucket using:

aws s3 cp "MOCK_DATA.txt.gz" s3://test-bucket/

Input File Field Delimiters

Windows Command Prompt

Given that the fields in the test file are tab delimited, trying to enter a tab as part of command from the Windows Command Prompt triggers command Auto-Completion.

To get around this, some options are:

Read Delimiter from a File into a Variable

Create a text file fielddelim.txt containing a tab enclosed in double quotes

The contents of the file can be read and saved into a variable as follows:

c:\>set /P fielddelim=<fielddelim.txt

Variable fielddelim will resolve to a tab literal and can be referenced using %fielddelim% at the Windows Command prompt

Disable cmd.exe File and Directory Name Completion

At the command prompt, you can disable file and directory name completion by entering the following:

Powershell

In Powershell, a tab literal can be represented by `t

Linux Bash

In Bash, a tab can be echoed via echo -e t or \t

Continuation of Long Commands

For Windows Command, a long command can be continued onto the next line by entering ^ at the end of the line.

For Powershell, command can be continued by ` at the end of a line.

In Bash, a backslash \ is used at the end of a line for continuation.

AWS CLI: select-object-content

With the file now in at target s3 bucket s3://test-bucket/MOCK_DATA.txt.gz, we can start carrying some verification.

The AWS CLI provides the select-object-content command for running simple SQL against bucket objects.

Example 1 : Get Record Count

Windows Command Prompt

c:\>set /P fielddelim=<fielddelim.txt 
c:\>aws s3api select-object-content ^ --bucket test-bucket ^
--key MOCK_DATA.txt.gz ^ --expression-type SQL ^
--expression "select count(*) from s3object" ^
--input-serialization CSV={"FieldDelimiter"='%fielddelim%'},"CompressionType"="GZIP" ^
--output-serialization CSV={"FieldDelimiter"='%fielddelim%'} ^
"results.csv" && type "results.csv" && del "results.csv"

Powershell Command

PS C:\>aws s3api select-object-content `
--bucket test-bucket ` --key MOCK_DATA.txt.gz `
--expression-type SQL ` --expression "select count(*)from s3object" ` --input-serialization `
"CSV={FieldDelimiter=`'`t`'},CompressionType='GZIP'" `
--output-serialization "CSV={FieldDelimiter=`'`t`'}" results.csv `
; if($?) {cat "results.csv"} `
; if($?) {del "results.csv"}

Bash Command

ubuntu:~$ aws s3api select-object-content \
--bucket test-bucket \
--key MOCK_DATA.txt.gz \
--expression-type SQL \
--expression "select count(*)from s3object" \
--input-serialization \
"CSV={FieldDelimiter=$(echo -e "'\t'")},CompressionType='GZIP'" \
--output-serialization "CSV={FieldDelimiter=$(echo -e "'\t'")}" \
/dev/stdout 2>&1

All the above return a record count of 1001.

Example 2 : Display 10 Records

Windows Command Prompt

c:\>set /P fielddelim=<fielddelim.txt 
c:\>aws s3api select-object-content ^
--bucket test-bucket ^
--key MOCK_DATA.txt.gz ^
--expression-type SQL ^
--expression "select * from s3object limit 10" ^
--input-serialization CSV={"FieldDelimiter"='%fielddelim%'},"CompressionType"="GZIP" ^
--output-serialization CSV={"FieldDelimiter"='%fielddelim%'} ^
"results.csv" && type "results.csv" && del "results.csv"

Powershell Command

PS C:\>aws s3api select-object-content `
--bucket test-bucket ` --key MOCK_DATA.txt.gz `
--expression-type SQL ` --expression "select * from s3object limit 10" `
--input-serialization ` "CSV={FieldDelimiter=`'`t`'},CompressionType='GZIP'" `
--output-serialization "CSV={FieldDelimiter=`'`t`'}" results.csv `
; if($?) {cat "results.csv"} `
; if($?) {del "results.csv"}

Bash Command

ubuntu:~$ aws s3api select-object-content \
--bucket test-bucket \
--key MOCK_DATA.txt.gz \
--expression-type SQL \
--expression "select * from s3object limit 10" \
--input-serialization \
"CSV={FieldDelimiter=$(echo -e "'\t'")},CompressionType='GZIP'" \ --output-serialization "CSV={FieldDelimiter=$(echo -e "'\t'")}" \ /dev/stdout 2>&1

All above commands return:

Example 3: List first_name, email, gender where gender is equal to “Male” and first_name starts with “Ja”.

Windows Command Prompt

aws s3api select-object-content ^
--bucket test-bucket ^
--key MOCK_DATA.txt.gz ^
--expression-type SQL ^
--expression "select _1,_3,_4 from s3object where _4 = 'Male' and _1 LIKE 'Ja%'" ^
--input-serialization CSV={"FieldDelimiter"='%fielddelim%'},"CompressionType"="GZIP" ^
--output-serialization CSV={"FieldDelimiter"='%fielddelim%'} ^
"results.csv" && type "results.csv" && del "results.csv"

Powershell Command

aws s3api select-object-content `
--bucket test-bucket `
--key MOCK_DATA.txt.gz `
--expression-type SQL `
--expression "select _1,_3,_4 from s3object where _4 = 'Male' and _1 LIKE 'Ja%'" `
--input-serialization `
"CSV={FieldDelimiter=`'`t`'},CompressionType='GZIP'" `
--output-serialization "CSV={FieldDelimiter=`'`t`'}" results.csv `
; if($?) {cat "results.csv"} `
; if($?) {del "results.csv"}

Bash Command

ubuntu:~$ aws s3api select-object-content \
--bucket test-bucket \
--key MOCK_DATA.txt.gz \
--expression-type SQL \
--expression "select _1,_3,_4 from s3object where _4 = 'Male' and _1 LIKE 'Ja%'" \
--input-serialization \
"CSV={FieldDelimiter=$(echo -e "'\t'")},CompressionType='GZIP'" \
--output-serialization "CSV={FieldDelimiter=$(echo -e "'\t'")}" \
/dev/stdout 2>&1

The output of each of the above commands is identical:

Note: The fields referenced in the select statements are the corresponding references to the CSV columns in the input file:

_1 = first_name 
_2 = last_name
_3 = email
_4 = gender
_5 = ip_address

Further Reading

AWS Documentation

Limitations

The following is quoted from the AWS Documentation:

Amazon S3 Select and Glacier Select queries currently do not support subqueries or joins

Other limits/requirements are mentioned in the documentation, which can be found at the links above.

Originally published at http://github.com.

Learner. Interests include Cloud and Devops technologies.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store