Introduction to Querying S3 Data with Teradata READ_NOS
Native Object Store (NOS) is a Teradata feature which allows for data-in-place operations against a variety of external store types, including S3. By using inbuilt SQL functions and operators, an end user is able to perform read/write operations. For example, table operator READ_NOS can be used to read/search the contents of a store.
This article aims to:
• cover the Teradata activities required for accessing READ_NOS table operator
• provide an overview of a publicly accessible Teradata-supplied S3 store, which could be used to run code samples (when access to create an S3 store is not possible)
• describe the procedure for creating an AWS S3 data store
• present some examples of querying the store using READ_NOS
Teradata Setup Activities
All code samples presented throughout the article have been tested using a local instance of Teradata Vantage Express 17.10 for Virtual Box, which can be downloaded after signing up for a free account at https://downloads.teradata.com. Due to variations in operator syntax between Teradata versions, be sure to check the official documentation if running a version other than 17.10.
Once the local instance running, an NOS user with appropriate access rights should be created by logging onto the RDBMS with username/password
of dbc/dbc
and executing the statements outlined below.
NOS User
Create a Teradata username/password NOS_DEMO_USER/NOS_DEMO_USER
, to be used when running code samples.
CREATE USER NOS_DEMO_USER FROM DBC AS PASSWORD=NOS_DEMO_USER, PERM=50e8;
Grant Access
Grant the new user permissions to execute READ_NOS operator.
GRANT EXECUTE FUNCTION on TD_SYSFNLIB.READ_NOS to NOS_DEMO_USER;
Teradata-supplied Sample Store
External stores can be classified as either public or private. Authentication is not required to access a public store, however, private stores are secured.
The S3 store supplied by Teradata is a public store containing sample river flow data sourced from USGS Surface-Water Data Sets.
Official documentation references the store as td-usgs-public.s3.amazonaws.com
. This S3 virtual style address takes the form: bucket-name.s3.Region.amazonaws.com/key-name
.
The omission of the Region
in the store name is related to the interpretation of the S3 Service endpoint when Region
has a value of us-east-1
, i.e. s3.amazonaws.com
.
Understanding the store addressing style is important when working with Teradata NOS operators. This will become apparent when working through the examples.
Store Contents
Below is a visual representation of the store folder hierarchy and file naming convention:
----------------------------------------------
* Teradata public store
* Contains river flow data sourced
from USGS Surface-Water Data Sets
(https://waterdata.usgs.gov/nwis/sw).
----------------------------------------------
s3://td-usgs-public
└───
├── /<data-file-format>
│ ├── /<river-site-id>
│ │ └── /<yyyy>
│ │ ├── /<mm>
│ │ │ ├── /<dd>.<ext>
....
....
Files/folders listed above are named according to the descriptions below.
Top level folder :<data-file-format>
Common names are:
• CSVDATA
: CSV files
• PARQUETDATA
: Parquet files
• JSON
: Json files
Second level folder : <river-site-id>
• folder name is the 8 digit site identifier where river flow data was collected
• for example value of 09394500
represents LITTLE COLORADO RIVER AT
WOODRUFF, AZ
• the following link to USGS website contains an interactive map with river site details
Third level folder : <yyyy>
• folder name is a 4 digit data collection year
Fourth level folder : <mm>
• folder name is the 2 digit month of year when data was collected
Filename : <dd>.<ext>
• <dd> the 2 digit day of month when data was collected
• <ext> file extension, which describes content format
possible values are: csv
parquet
json
Authentication
Public stores can be accessed anonymously. From a Teradata to S3 authentication context, this translates to supplying empty values for AccessKeyId
and SecretAccessKey
:
AccessKeyId: ''
SecretAccessKey: ''
Inputs to Teradata READ_NOS when Working with Teradata Store
Summing up, the following store details will be required as input to READ_NOS.
Bucketname: td-usgs-public
S3 Endpoint: s3.amazonaws.com
AccessKeyId: ''
SecretAccessKey: ''
Creating an S3 Store
The steps below outline the general procedure for setting up an S3 private store.
Step 1 — Create Store
Below is an example for creating a store named s3://my-private-surface-water-data-store
in AWS region us-west-1
using the AWS CLI:
$ aws \
--profile aws_acct_admin \
--region us-west-1 \
s3 mb s3://my-private-surface-water-data-store
The region specified is us-west-1
and has corresponding S3 Service endpoint of s3.us-west-1.amazonaws.com
. Significance of the S3 Service endpoint will become apparent when running the NOS examples presented later on.
Step 2 — Create IAM User
In order to create credentials (IAM Access key) for accessing the store, an IAM user will be required. Below is an example for creating user with friendly name my-private-surface-water-data-user
$ aws \
--profile aws_acct_admin \
iam create-user --user-name my-private-surface-water-data-user
Output:
{
"User": {
"Path": "/",
"UserName": "my-private-surface-water-data-user",
"UserId": "uuuuuuuuuuuuuuu",
"Arn": "arn:aws:iam::xxxxxxxxxxxx:user/my-private-surface-water-data-user",
....
}
}
Step 3 — Credentials for IAM User
Generate an IAM Access key (AccessKeyId
/ SecretAccessKey
) for the new user :
$ aws \
--profile aws_acct_admin \
iam create-access-key --user-name my-private-surface-water-data-user
Output should contain the AccessKey
details:
{
"AccessKey": {
...
"AccessKeyId": "AKIATZSC8AVA4PNC6VLC",
...
"SecretAccessKey": "qpGYpDMFiKgPczxEJgKgxFO9cUhiiuqJFESl61Gw",
...
}
}
Step 4 — Grant IAM User Access to S3 Store
To execute read operations using Teradata operator READ_NOS
, the new user would require the following access to the store:
S3:GetObject
S3:ListBucket
S3:GetBucketLocation
Attach an inline policy named my-private-surface-water-data-policy
to the user. The policy includes the above access:
$ aws \
--profile aws_acct_admin \
iam put-user-policy \
--user-name my-private-surface-water-data-user \
--policy-name my-private-surface-water-data-policy \
--policy-document \
'{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Effect": "Allow",
"Resource": "arn:aws:s3:::my-private-surface-water-data-store"
},
{
"Sid": "Stmt2",
"Action": [
"s3:GetObject"
],
"Effect": "Allow",
"Resource": "arn:aws:s3:::my-private-surface-water-data-store/*"
}
]
}'
Step5 — Upload Sample CSV Data
Sample data from the Teradata public store is low volume (~12MB) and serves as a good starting point for NOS discovery.
Copy the contents over to the new bucket using the AWS CLI.
$ aws \
--profile aws_acct_admin \
s3 sync s3://td-usgs-public \
s3://my-private-surface-water-data-store
Step 6 — Note Down Private Store Details
The final details for the new store are summarised below. These will required as inputs to Teradata READ_NOS operator.
Bucketname: my-private-surface-water-data-store
S3 Endpoint: s3.us-west-1.amazonaws.com
AccessKeyId: AKIATZSC8AVA4PNC6VLC
SecretAccessKey: qpGYpDMFiKgPczxEJgKgxFO9cUhiiuqJFESl61Gw
Teradata READ_NOS Operator
A brief introduction to READ_NOS basic syntax elements is discussed next before moving onto practical usage examples.
READ_NOS Syntax
Basic syntax elements followed by descriptions are shown below. These should serve as a good starting point for exploring an S3 store.
Complete list of elements is included in the official SQL Operators and User Defined Functions — READ_NOS user guide.
SELECT *
FROM READ_NOS (
USING
LOCATION ('/connector/bucket.endpoint/[key_prefix]')
AUTHORIZATION (
'{
"Access_ID":"<ACCESS-KEY>"
,"Access_Key":"<SECRET-ACCESS-KEY>"
}'
)
RETURNTYPE ( { 'NOSREAD_RECORD' | 'NOSREAD_KEYS' } )
) AS D
;
LOCATION (‘/connector/bucket.endpoint/[key_prefix]’)
connector
—
external storage system type identifier (for S3 data stores, the value is s3
)
bucket
—
S3 bucket name
endpoint
—
S3 service endpoint based on bucket location constraint (Region) as listed at S3 Service Endpoints. For example, buckets created in region
us-east-1
will have an endpoint of s3.amazonaws.com
key_prefix
—
key/object qualifier
AUTHORIZATION (
'{
"Access_ID":"<ACCESS-KEY>"
,"Access_Key":"<SECRET-ACCESS-KEY>"
}'
)
- IAM key details required to access the S3 bucket
- NOTE: for PUBLIC buckets
<ACCESS-KEY>
&<SECRET-ACCESS-KEY>
, would be passed as empty strings
RETURNTYPE ( { ‘NOSREAD_RECORD’ | ‘NOSREAD_KEYS’ } )
NOSREAD_KEYS
—
indicates that only records relating to details of object paths/keys are to be returned in the result
NOSREAD_RECORD
—
indicates records should be returned from contents of file(s)
READ_NOS Examples
A few examples of querying the private store created in section "Creating an S3 Store"
will be presented in this section.
Inputs to READ_NOS examples are as noted during sample private store creation. The values, once again, are listed for convenience.
bucketname: my-private-surface-water-data-store
S3 service Endpoint: s3.us-west-1.amazonaws.com
AccessKeyId: AKIATZSC8AVA4PNC6VLC
SecretAccessKey: qpGYpDMFiKgPczxEJgKgxFO9cUhiiuqJFESl61Gw
These will need to be substituted into the READ_NOS
syntax expression described in the previous section:
connector
: s3bucket
: my-private-surface-water-data-storeendpoint
: s3.us-west-1.amazonaws.com
<YOUR-ACCESS-KEY>
: AKIATZSC8AVA4PNC6VLC <YOUR-SECRET-ACCESS-KEY>
: qpGYpDMFiKgPczxEJgKgxFO9cUhiiuqJFESl61Gw
Values have been substituted in examples that follow.
Alternatively, if using the Teradata-supplied public store, the values to be substitued are:
connector
: s3 bucket
: td-usgs-public endpoint
: s3.amazonaws.com <YOUR-ACCESS-KEY>
: '' <YOUR-SECRET-ACCESS-KEY>
: ''
Example 1: Listing Store Objects
Below shows RETURNTYPE ('NOSREAD_KEYS')
being used to return the TOP 5
objects/keys.
SELECT TOP 5 location
FROM
READ_NOS (
USING
LOCATION ('/s3/my-private-surface-water-data-store.s3.us-west-1.amazonaws.com/')
AUTHORIZATION ('{"Access_ID":"AKIATZSC8AVA4PNC6VLC" ,"Access_Key":"qpGYpDMFiKgPczxEJgKgxFO9cUhiiuqJFESl61Gw"}')
RETURNTYPE ('NOSREAD_KEYS')
) AS D
ORDER BY 1;
Output:
Visual representation of the above would show the following folder structure:
s3://my-private-surface-water-data-store
└───
├── /CSVDATA
│ ├── /09380000
│ │ └── /2018
│ │ ├── /06
│ │ │ ├── /27.csv
│ │ │ ├── /28.csv
│ │ │ ├── /29.csv
│ │ │ ├── /30.csv
│ │ ├── /07
│ │ │ ├── /01.csv
...
...
The structure is derived from Location
values returned in previous query.
Example 2: Get the Total Number of Store Objects
Count of objects/files can be retrieved using:
SELECT
COUNT(*) AS FILE_COUNT
FROM
(
SELECT *
FROM READ_NOS (
USING
LOCATION ( '/s3/my-private-surface-water-data-store.s3.us-west-1.amazonaws.com/' )
AUTHORIZATION ('{"Access_ID":"AKIATZSC8AVA4PNC6VLC"
,"Access_Key":"qpGYpDMFiKgPczxEJgKgxFO9cUhiiuqJFESl61Gw"}')
RETURNTYPE ('NOSREAD_KEYS') ) AS D
) T
Output:
Example 3: View Sample Records in a File
Recall the first Location
observed in the result set from Example 1
which represents a file/object:
To retrieve the top 5 records from the file, use RETURNTYPE ('NOSREAD_RECORD')
as follows:
SELECT TOP 5 *
FROM
READ_NOS (
USING
LOCATION ('/s3/my-private-surface-water-data-store.s3.us-west-1.amazonaws.com/CSVDATA/09380000/2018/06/27.csv')
AUTHORIZATION ('{"Access_ID":"AKIATZSC8AVA4PNC6VLC"
,"Access_Key":"qpGYpDMFiKgPczxEJgKgxFO9cUhiiuqJFESl61Gw"}')
RETURNTYPE ('NOSREAD_RECORD')
) AS D
Output:
Note the following from the results:
- READ_NOS returned the result set including column names
- this is a direct result of the implicit value for READ_NOS parameter
HEADER
being set to TRUE and presence of a header record within the file
Verifying READ_NOS results
Use the AWS CLI, output from Example 3
can be verified by piping first 5 records of the file tostdout
.
$ aws s3 \
--profile aws_acct_admin \
cp s3://my-private-surface-water-data-store/CSVDATA/09380000/2018/06/27.csv - \
| head -n5
Output:
Conclusion
In summary:
- NOS is a Teradata capability which allows end users to interact with an external data store, such as an S3 bucket
- stores can be setup as publicly accessible or secured
- READ_NOS is a Teradata operator which allows a user to run
select
statements against store contents - when authenticating to an S3 public using this operator, credentials are supplied as empty strings, i.e.
Access_ID = ''
Access_Key = ''
- an example of a public S3 store supplied by Teradata can be found at
s3://td-usgs-public
- if attempting to execute READ_NOS examples in this article on a database version other than 17.10, be sure to check version specific documentation