AWS S3: Simple Data Mining Techniques
--
Recently, I was involved in a piece of work to move some fairly large on-site database tables to AWS S3.
Part of the post-upload verification included reconciling record counts and visual inspection of sample data to ensure format was as expected.
Ideally, AWS Athena would have been the user-friendly way of achieving this, however, there were some organisational constraints on accessing additional AWS services, including Athena.
This article focuses on a few alternative methods that can be used to perform simple mining/verification of data in S3 buckets.
Methods Covered
AWS CLI command:
- aws s3api select-object-content
Examples of executing this command from different OS shells ( Windows Powershell, Command Prompt and Linux Bash) will also be included, with the aim of demonstrating some subtle differences that cover:
- Use of non displayable command parameters (such as tabs)
- Piping commands and limitations
Prerequisites
Windows 10
Install the AWS CLI, which can be downloaded from here.
Some examples using Windows 10 make use of gzip
. If you are interested in running these, then install Git Bash for Windows, which includes the binaries required.
Linux
Use the procedure for your specific Linux distribution to install the awscli
tools. Ubuntu is used for examples in this article. Installation on Ubuntu can be accomplished via pip
:
pip3 install awscli
After the AWS CLI is installed, configuration with AWS security credentials is required. Quick config instructions are provided in the AWS Documentation.
Creating the Test File (.gz)
A test file, used for the examples, was generated using Mockaroo.
The file contains 1001 mocked-up records (including header), with the fields shown below.
first_name last_name email gender ip_address
Fields are delimited and the file is saved as MOCK_DATA.txt
After saving, the file was gzipped with a final name MOCK_DATA.txt.gz
. The final .gz
can be downloaded from here.
Filed was upload to the target S3 bucket using:
aws s3 cp "MOCK_DATA.txt.gz" s3://test-bucket/
Input File Field Delimiters
Windows Command Prompt
Given that the fields in the test file are tab delimited, trying to enter a tab as part of command from the Windows Command Prompt triggers command Auto-Completion.
To get around this, some options are:
Read Delimiter from a File into a Variable
Create a text file fielddelim.txt
containing a tab
enclosed in double quotes
The contents of the file can be read and saved into a variable as follows:
c:\>set /P fielddelim=<fielddelim.txt
Variable fielddelim
will resolve to a tab literal and can be referenced using %fielddelim%
at the Windows Command prompt
Disable cmd.exe File and Directory Name Completion
At the command prompt, you can disable file and directory name completion by entering the following:
Powershell
In Powershell, a tab literal can be represented by `t
Linux Bash
In Bash, a tab can be echoed via echo -e t
or \t
Continuation of Long Commands
For Windows Command, a long command can be continued onto the next line by entering ^
at the end of the line.
For Powershell, command can be continued by `
at the end of a line.
In Bash, a backslash \
is used at the end of a line for continuation.
AWS CLI: select-object-content
With the file now in at target s3 bucket s3://test-bucket/MOCK_DATA.txt.gz
, we can start carrying some verification.
The AWS CLI provides the select-object-content
command for running simple SQL against bucket objects.
Example 1 : Get Record Count
Windows Command Prompt
c:\>set /P fielddelim=<fielddelim.txt
c:\>aws s3api select-object-content ^ --bucket test-bucket ^
--key MOCK_DATA.txt.gz ^ --expression-type SQL ^
--expression "select count(*) from s3object" ^
--input-serialization CSV={"FieldDelimiter"='%fielddelim%'},"CompressionType"="GZIP" ^
--output-serialization CSV={"FieldDelimiter"='%fielddelim%'} ^
"results.csv" && type "results.csv" && del "results.csv"
Powershell Command
PS C:\>aws s3api select-object-content `
--bucket test-bucket ` --key MOCK_DATA.txt.gz `
--expression-type SQL ` --expression "select count(*)from s3object" ` --input-serialization `
"CSV={FieldDelimiter=`'`t`'},CompressionType='GZIP'" `
--output-serialization "CSV={FieldDelimiter=`'`t`'}" results.csv `
; if($?) {cat "results.csv"} `
; if($?) {del "results.csv"}
Bash Command
ubuntu:~$ aws s3api select-object-content \
--bucket test-bucket \
--key MOCK_DATA.txt.gz \
--expression-type SQL \
--expression "select count(*)from s3object" \
--input-serialization \
"CSV={FieldDelimiter=$(echo -e "'\t'")},CompressionType='GZIP'" \
--output-serialization "CSV={FieldDelimiter=$(echo -e "'\t'")}" \
/dev/stdout 2>&1
All the above return a record count of 1001
.
Example 2 : Display 10 Records
Windows Command Prompt
c:\>set /P fielddelim=<fielddelim.txt
c:\>aws s3api select-object-content ^
--bucket test-bucket ^
--key MOCK_DATA.txt.gz ^
--expression-type SQL ^
--expression "select * from s3object limit 10" ^
--input-serialization CSV={"FieldDelimiter"='%fielddelim%'},"CompressionType"="GZIP" ^
--output-serialization CSV={"FieldDelimiter"='%fielddelim%'} ^
"results.csv" && type "results.csv" && del "results.csv"
Powershell Command
PS C:\>aws s3api select-object-content `
--bucket test-bucket ` --key MOCK_DATA.txt.gz `
--expression-type SQL ` --expression "select * from s3object limit 10" `
--input-serialization ` "CSV={FieldDelimiter=`'`t`'},CompressionType='GZIP'" `
--output-serialization "CSV={FieldDelimiter=`'`t`'}" results.csv `
; if($?) {cat "results.csv"} `
; if($?) {del "results.csv"}
Bash Command
ubuntu:~$ aws s3api select-object-content \
--bucket test-bucket \
--key MOCK_DATA.txt.gz \
--expression-type SQL \
--expression "select * from s3object limit 10" \
--input-serialization \
"CSV={FieldDelimiter=$(echo -e "'\t'")},CompressionType='GZIP'" \ --output-serialization "CSV={FieldDelimiter=$(echo -e "'\t'")}" \ /dev/stdout 2>&1
All above commands return:
Example 3: List first_name, email, gender where gender is equal to “Male” and first_name starts with “Ja”.
Windows Command Prompt
aws s3api select-object-content ^
--bucket test-bucket ^
--key MOCK_DATA.txt.gz ^
--expression-type SQL ^
--expression "select _1,_3,_4 from s3object where _4 = 'Male' and _1 LIKE 'Ja%'" ^
--input-serialization CSV={"FieldDelimiter"='%fielddelim%'},"CompressionType"="GZIP" ^
--output-serialization CSV={"FieldDelimiter"='%fielddelim%'} ^
"results.csv" && type "results.csv" && del "results.csv"
Powershell Command
aws s3api select-object-content `
--bucket test-bucket `
--key MOCK_DATA.txt.gz `
--expression-type SQL `
--expression "select _1,_3,_4 from s3object where _4 = 'Male' and _1 LIKE 'Ja%'" `
--input-serialization `
"CSV={FieldDelimiter=`'`t`'},CompressionType='GZIP'" `
--output-serialization "CSV={FieldDelimiter=`'`t`'}" results.csv `
; if($?) {cat "results.csv"} `
; if($?) {del "results.csv"}
Bash Command
ubuntu:~$ aws s3api select-object-content \
--bucket test-bucket \
--key MOCK_DATA.txt.gz \
--expression-type SQL \
--expression "select _1,_3,_4 from s3object where _4 = 'Male' and _1 LIKE 'Ja%'" \
--input-serialization \
"CSV={FieldDelimiter=$(echo -e "'\t'")},CompressionType='GZIP'" \
--output-serialization "CSV={FieldDelimiter=$(echo -e "'\t'")}" \
/dev/stdout 2>&1
The output of each of the above commands is identical:
Note: The fields referenced in the select statements are the corresponding references to the CSV columns in the input file:
_1 = first_name
_2 = last_name
_3 = email
_4 = gender
_5 = ip_address
Further Reading
AWS Documentation
Limitations
The following is quoted from the AWS Documentation:
Amazon S3 Select and Glacier Select queries currently do not support subqueries or joins
Other limits/requirements are mentioned in the documentation, which can be found at the links above.
Originally published at http://github.com.