AWS S3: Simple Data Mining Techniques

Recently, I was involved in a piece of work to move some fairly large on-site database tables to AWS S3.

Part of the post-upload verification included reconciling record counts and visual inspection of sample data to ensure format was as expected.

Ideally, AWS Athena would have been the user-friendly way of achieving this, however, there were some organisational constraints on accessing additional AWS services, including Athena.

This article focuses on a few alternative methods that can be used to perform simple mining/verification of data in S3 buckets.

Methods Covered

  • aws s3api select-object-content

Examples of executing this command from different OS shells ( Windows Powershell, Command Prompt and Linux Bash) will also be included, with the aim of demonstrating some subtle differences that cover:

  • Use of non displayable command parameters (such as tabs)
  • Piping commands and limitations

Prerequisites

Windows 10

Some examples using Windows 10 make use of gzip. If you are interested in running these, then install Git Bash for Windows, which includes the binaries required.

Linux

pip3 install awscli

After the AWS CLI is installed, configuration with AWS security credentials is required. Quick config instructions are provided in the AWS Documentation.

Creating the Test File (.gz)

The file contains 1001 mocked-up records (including header), with the fields shown below.

first_name last_name email gender ip_address

Fields are delimited and the file is saved as MOCK_DATA.txt

After saving, the file was gzipped with a final name MOCK_DATA.txt.gz. The final .gz can be downloaded from here.

Filed was upload to the target S3 bucket using:

aws s3 cp "MOCK_DATA.txt.gz" s3://test-bucket/

Input File Field Delimiters

Windows Command Prompt

To get around this, some options are:

Read Delimiter from a File into a Variable

Create a text file fielddelim.txt containing a tab enclosed in double quotes

The contents of the file can be read and saved into a variable as follows:

c:\>set /P fielddelim=<fielddelim.txt

Variable fielddelim will resolve to a tab literal and can be referenced using %fielddelim% at the Windows Command prompt

Disable cmd.exe File and Directory Name Completion

At the command prompt, you can disable file and directory name completion by entering the following:

Powershell

Linux Bash

Continuation of Long Commands

For Powershell, command can be continued by ` at the end of a line.

In Bash, a backslash \ is used at the end of a line for continuation.

AWS CLI: select-object-content

The AWS CLI provides the select-object-content command for running simple SQL against bucket objects.

Example 1 : Get Record Count

c:\>set /P fielddelim=<fielddelim.txt 
c:\>aws s3api select-object-content ^ --bucket test-bucket ^
--key MOCK_DATA.txt.gz ^ --expression-type SQL ^
--expression "select count(*) from s3object" ^
--input-serialization CSV={"FieldDelimiter"='%fielddelim%'},"CompressionType"="GZIP" ^
--output-serialization CSV={"FieldDelimiter"='%fielddelim%'} ^
"results.csv" && type "results.csv" && del "results.csv"

Powershell Command

PS C:\>aws s3api select-object-content `
--bucket test-bucket ` --key MOCK_DATA.txt.gz `
--expression-type SQL ` --expression "select count(*)from s3object" ` --input-serialization `
"CSV={FieldDelimiter=`'`t`'},CompressionType='GZIP'" `
--output-serialization "CSV={FieldDelimiter=`'`t`'}" results.csv `
; if($?) {cat "results.csv"} `
; if($?) {del "results.csv"}

Bash Command

ubuntu:~$ aws s3api select-object-content \
--bucket test-bucket \
--key MOCK_DATA.txt.gz \
--expression-type SQL \
--expression "select count(*)from s3object" \
--input-serialization \
"CSV={FieldDelimiter=$(echo -e "'\t'")},CompressionType='GZIP'" \
--output-serialization "CSV={FieldDelimiter=$(echo -e "'\t'")}" \
/dev/stdout 2>&1

All the above return a record count of 1001.

Example 2 : Display 10 Records

c:\>set /P fielddelim=<fielddelim.txt 
c:\>aws s3api select-object-content ^
--bucket test-bucket ^
--key MOCK_DATA.txt.gz ^
--expression-type SQL ^
--expression "select * from s3object limit 10" ^
--input-serialization CSV={"FieldDelimiter"='%fielddelim%'},"CompressionType"="GZIP" ^
--output-serialization CSV={"FieldDelimiter"='%fielddelim%'} ^
"results.csv" && type "results.csv" && del "results.csv"

Powershell Command

PS C:\>aws s3api select-object-content `
--bucket test-bucket ` --key MOCK_DATA.txt.gz `
--expression-type SQL ` --expression "select * from s3object limit 10" `
--input-serialization ` "CSV={FieldDelimiter=`'`t`'},CompressionType='GZIP'" `
--output-serialization "CSV={FieldDelimiter=`'`t`'}" results.csv `
; if($?) {cat "results.csv"} `
; if($?) {del "results.csv"}

Bash Command

ubuntu:~$ aws s3api select-object-content \
--bucket test-bucket \
--key MOCK_DATA.txt.gz \
--expression-type SQL \
--expression "select * from s3object limit 10" \
--input-serialization \
"CSV={FieldDelimiter=$(echo -e "'\t'")},CompressionType='GZIP'" \ --output-serialization "CSV={FieldDelimiter=$(echo -e "'\t'")}" \ /dev/stdout 2>&1

All above commands return:

Example 3: List first_name, email, gender where gender is equal to “Male” and first_name starts with “Ja”.

aws s3api select-object-content ^
--bucket test-bucket ^
--key MOCK_DATA.txt.gz ^
--expression-type SQL ^
--expression "select _1,_3,_4 from s3object where _4 = 'Male' and _1 LIKE 'Ja%'" ^
--input-serialization CSV={"FieldDelimiter"='%fielddelim%'},"CompressionType"="GZIP" ^
--output-serialization CSV={"FieldDelimiter"='%fielddelim%'} ^
"results.csv" && type "results.csv" && del "results.csv"

Powershell Command

aws s3api select-object-content `
--bucket test-bucket `
--key MOCK_DATA.txt.gz `
--expression-type SQL `
--expression "select _1,_3,_4 from s3object where _4 = 'Male' and _1 LIKE 'Ja%'" `
--input-serialization `
"CSV={FieldDelimiter=`'`t`'},CompressionType='GZIP'" `
--output-serialization "CSV={FieldDelimiter=`'`t`'}" results.csv `
; if($?) {cat "results.csv"} `
; if($?) {del "results.csv"}

Bash Command

ubuntu:~$ aws s3api select-object-content \
--bucket test-bucket \
--key MOCK_DATA.txt.gz \
--expression-type SQL \
--expression "select _1,_3,_4 from s3object where _4 = 'Male' and _1 LIKE 'Ja%'" \
--input-serialization \
"CSV={FieldDelimiter=$(echo -e "'\t'")},CompressionType='GZIP'" \
--output-serialization "CSV={FieldDelimiter=$(echo -e "'\t'")}" \
/dev/stdout 2>&1

The output of each of the above commands is identical:

Note: The fields referenced in the select statements are the corresponding references to the CSV columns in the input file:

_1 = first_name 
_2 = last_name
_3 = email
_4 = gender
_5 = ip_address

Further Reading

AWS Documentation

Limitations

Amazon S3 Select and Glacier Select queries currently do not support subqueries or joins

Other limits/requirements are mentioned in the documentation, which can be found at the links above.

Originally published at http://github.com.

Learner. Interests include Cloud and Devops technologies.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store