Teradata UIFs with Anaconda 3
Data Science Readiness
--
I’ve previously posted on the capabilities of Python and Teradata UIFs (User-Installed-Files). Up until now, the focus has been on working with UIFs using the pre-installed version of Python (2.6.9), which is the default version installed on the Teradata Node/s. That was the case when demonstrating examples using Teradata VMWare Express 16.20.
Python 2.6.9 is an “older” version with support and security patches likely to cease in the near future, so it makes sense to upgrade the Teradata Node/s to support Anaconda Python v3.7. This upgrade will make way for extending the capabilities of UIFs.
This article is an add-on/extension to my initial article and attempts to illustrate the process for upgrading the Teradata VMWare Express Edition (16.20) to support Anaconda 3.7.
Note, the upgrade process will be described for a SMP/Single Node install. Accomplishing this on a live Production system would require a slightly modified process. Most sites run MPP (Multiple Nodes) — not just single Node systems. Consult with your Teradata Site representative/engineer for further details.
The Host/Guest network still stands, i.e. the Teradata Node will not have Internet access. Files required will downloaded to the host before being copied to the Teradata Node.
Why Anaconda 3?
Anaconda 3 is an ideal Python distribution due to its increasing popularity in the Data Science space. The default installation comes with additional packages that aren’t available in the Official Python releases.
If you find that your favorite Python package is not installed by default, then you can simply add it using the respective Python package manager (conda or pip).
Sample Usage Scenarios for Python UIFs
Sample Scenario 1: Machine Learning Module (sklearn)
I want to use scikit-learn with my Teradata UIF, what are my options?
If we check out the Teradata Node, we can determine if sklearn is part of the existing Python 2.6.9 install:
TDExpress1620_Sles11:~ # pythonPython 2.6.9 (unknown, Mar 14 2018, 09:27:15)[GCC 4.3.4 [gcc-4_3-branch revision 152973]] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>> import sklearnTraceback (most recent call last):
File "<stdin>", line 1, in <module>ImportError: No module named sklearn>>>
From the above, we can see that the module is not installed by default.
Our options are:
- Install a Python 2.6.9 compatible sklearn package
OR
- Upgrade to Anaconda 3, which already includes sklearn by default
Sample Scenario 2: Parsing URL & User-Agent Strings
I have raw URLs and User-Agent strings, stored in a Teradata table. I need to parse these using a Python UIF
Our options are:
- Install Python 2.6.9 compatible package that supports User-Agent parsing
OR
By default both Python 2.6.9 and Anaconda come pre-installed with packages that support URL parsing capabilities.
Preparing for the Upgrade to Anaconda 3
Enable Shared Folders Between Host and VM/Guest
We will need a way of sharing files between the Host and Guest. We can do this by enabling Shared Folders:
Configure Teradata 16.20 VM Express to support shared folders by:
- Selecting “Player” -> “Manage” -> “Virtual Machine Settings”
- Click the “Options” tab.
- Click on “Shared Folders” and change setting to “Enabled”
- Click on “Add” to set the location of the shared folder on your host.
The above settings show what’s required for adding host folder :
S:\Virtual_Machines\Teradata\TDExpress16.20_Sles11\share
as a shared folder between the host and guest OS.
Check Existing Environment on the Teradata Node
Find default Python/Pip binaries and their location.
TDExpress1620_Sles11:~ # which pip ; which python/usr/bin/pip
/usr/bin/python
The following shows that the paths are actually symbolic links.
TDExpress1620_Sles11:~ # ls -l /usr/bin/pip /usr/bin/python/usr/bin/pip -> /etc/alternatives/pip
/usr/bin/python -> python2.6
Our aim is to leave the existing underlying 2.6.9 binaries intact and update the symbolic links to point to the Anaconda 3, once the installation is complete.
Performing the Upgrade to Anaconda 3
On your host machine, download the latest Anaconda 3 Linux installer from https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh.
As of now, version 3.7 is the latest. Save the file into the shared folder that we enabled earlier on.
From within the VM, you’ll be able to access the installer contained in the shared folder.
By default, shares are mounted at the following location within the VM:
/mnt/hgfs
TDExpress1620_Sles11:~ # cd /mnt/hgfs
TDExpress1620_Sles11:/mnt/hgfs # ls
shareTDExpress1620_Sles11:/mnt/hgfs # cd shareTDExpress1620_Sles11:/mnt/hgfs/share # ls
Anaconda3–2019.10-Linux-x86_64.sh
Copy the Anaconda installer (Anaconda3–2019.10-Linux-x86_64.sh
) over to /root/apps
:
TDExpress1620_Sles11:/mnt/hgfs/share # cd /rootTDExpress1620_Sles11:~ # mkdir apps
TDExpress1620_Sles11:~ # cd appsTDExpress1620_Sles11:~/apps # cp /mnt/hgfs/share/Anaconda3–2019.10-Linux-x86_64.sh .
Run the installer and once prompted read and agree to the licence agreement.
TDExpress1620_Sles11:~/apps # ./Anaconda3–2019.10-Linux-x86_64.sh
Choose the target location for the installation as /usr/local/anaconda3
and choose “no” when prompted to initialize by running conda init
:
[/root/anaconda3] >>> /usr/local/anaconda3Do you wish the installer to initialize Anaconda3
by running conda init? [yes|no]
[no] >>> no
Wait for the install to complete, and then check the target directory as shown below to confirm the 3.7 binaries have been installed:
TDExpress1620_Sles11:cd /usr/local/anaconda3/binTDExpress1620_Sles11:/usr/local/anaconda3/bin # ls pip python3.7
pip python3.7
Update symbolic link /usr/bin/python
to reference the Anaconda install:
TDExpress1620_Sles11~#: rm /usr/bin/python
TDExpress1620_Sles11:~ # ln -s /usr/local/anaconda3/bin/python3.7 /usr/bin/python
Update symbolic link /usr/bin/pip
to reference the Anaconda install:
TDExpress1620_Sles11:~ # rm /usr/bin/pipTDExpress1620_Sles11:~ # ln -s /usr/local/anaconda3/bin/pip /usr/bin/pip
Run the following to grant permissions for executing Anaconda .py scripts:
chmod -R 755 /usr/local/anaconda3
Check that default Python/Pip versions now reference the Anaconda install:
TDExpress1620_Sles11:~ # pythonPython 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>>exit()TDExpress1620_Sles11:~ # pip --version
pip 19.2.3 from /usr/local/anaconda3/lib/python3.7/site-packages/pip (python 3.7)
Edit root
’ profile to include Anaconda binaries in the $PATH. This will allow us to use the conda
command for package maintenance.
/root/.bashrc:
export PATH=$PATH:/opt/teradata/jvm64/jdk8/bin:/usr/local/anaconda3/bin
Solution for “Sample Scenario 2: User-Agent UIF Parser”
Getting back to “Sample Scenario 2” User-Agent parsing discussed earlier.
To build our solution, we need to add Python package ua-parser
to the Anaconda package library.
Installing ua_parser using conda
Download ua_parser
(file name: ua-parser-0.8.0-py_0.tar.bz2
)
and save to the shared directory on the host.
Go back to the Guest VM, and install the package.
cd /mnt/hgfs/shareconda install ua-parser-0.8.0-py_0.tar.bz2 --use-local
After the install, check to ensure that the package can be imported using Anaconda:
TDExpress1620_Sles11:~ # pythonPython 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import ua_parser
>>>
Generate URL/User-Agent Sample Test Data
The below image shows sample test data that we’ll use as input for testing our UIF. The data was generated at Mockaroo.
We can load the data into a test table MYPROJ.UA_URL_DATA
as follows:
--CREATE TABLE TO HOLD TEST DATACREATE SET TABLE MYPROJ.UA_URL_DATA
,NO FALLBACK
,NO BEFORE JOURNAL
,NO AFTER JOURNAL (
ID SMALLINT,
URL VARCHAR(400) CHARACTER SET LATIN NOT CASESPECIFIC,
USER_AGENT VARCHAR(400) CHARACTER SET LATIN NOT CASESPECIFIC
)
PRIMARY INDEX ( ID )
;--POPULATE WITH TEST DATAINSERT INTO MYPROJ.UA_URL_DATA
VALUES
(1,'http://wikimedia.org/interdum/mauris.aspx?maecenas=consectetuer&rhoncus=adipiscing'
,'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17');INSERT INTO MYPROJ.UA_URL_DATA
VALUES
(2,'http://trellian.com/phasellus/in/felis/donec/semper.png?id=enim'
,'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.04 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11');INSERT INTO MYPROJ.UA_URL_DATA
VALUES
(3,'https://columbia.edu/accumsan.html?maecenas=justo'
,'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36');INSERT INTO MYPROJ.UA_URL_DATA
VALUES
(4,'https://histats.com/augue/vestibulum/rutrum/rutrum/neque/aenean/auctor.png?rutrum=magnis'
,'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21');INSERT INTO MYPROJ.UA_URL_DATA
VALUES
(5,'http://cargocollective.com/quis/augue/luctus/tincidunt/nulla/mollis/molestie.jsp?ridiculus=a'
,'Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.699.0 Safari/534.24');
Our UIF will read this table, parse the URL, URL Query String, User-Agent and return these in JSON format. The expected output column formats are as follows:
Write Python URL/User-Agent Parser
A sample UIF Python script would look like this:
#!/usr/bin/pythonimport sys
from urllib.parse import parse_qs
from urllib.parse import urlparse
from ua_parser import user_agent_parser
import jsonfor rec in sys.stdin:rec = rec.strip().split("\t")raw_url = urlparse(rec[1])
raw_user_agent = rec[2]
url_elements = (raw_url.scheme \
, raw_url.netloc \
, raw_url.path \
, raw_url.params \
, raw_url.fragment \
, raw_url.username \
, raw_url.password \
, raw_url.hostname \
, raw_url.port)
qs = parse_qs(raw_url.query)print(rec[0] + "\t" \
+ json.dumps(url_elements) \
+ "\t" + json.dumps(user_agent_parser.Parse(raw_user_agent)) \
+ "\t" + json.dumps(qs))
Install the UIF
You can use BTEQ to install the UIF using the following commands:
Teradata BTEQ 16.20.00.07 (32-bit) for WIN32. PID: 3012
Copyright 1984-2018 Teradata. All rights reserved.
Enter your logon or BTEQ command:
.logon dbc;.logon dbc
Password:***DATABASE MYPROJ;SET SESSION SEARCHUIFDBPATH = MYPROJ;CALL SYSUIF.INSTALL_FILE('url_ua_parser','url_ua_parser.py',
'cz!url_ua_parser.py');*** Procedure has been executed.
*** Total elapsed time was 1 second.
Run the UIF Against Test Table
To test if the UIF works as expected against our test data, we can run a test as follows:
SET SESSION SEARCHUIFDBPATH = MYPROJ;SEL
SCR_OPER.id
,SCR_OPER.url
,SCR_OPER.user_agent
,SCR_OPER.url_qs
FROM SCRIPT (ON (
SEL ID, URL, USER_AGENT
FROM MYPROJ.UA_URL_DATA)
SCRIPT_COMMAND('./MYPROJ/url_ua_parser.py')
RETURNS ('id VARCHAR(10)'
,' url VARCHAR(400)'
,' user_agent VARCHAR(400)'
, ' url_qs VARCHAR(400)' )
) AS SCR_OPER
;
Sample output is shown below.
Note: Internally, the UIF formatted the last 3 columns as JSON. In our Result Set, they are returned as character fields.
Convert these fields to a Teradata JSON by casting as JSON data type, i.e
SEL
SCR_OPER.id
,cast(SCR_OPER.url as JSON)
,cast(SCR_OPER.user_agent as JSON)
,cast(SCR_OPER.url_qs as JSON)FROM SCRIPT (ON (
SEL ID, URL, USER_AGENT
FROM MYPROJ.UA_URL_DATA)
SCRIPT_COMMAND('./MYPROJ/url_ua_parser.py')
RETURNS ('id VARCHAR(10)' ,' url VARCHAR(400)'
,' user_agent VARCHAR(400)'
,' url_qs VARCHAR(400)' )
) AS SCR_OPER
Extracting Items from JSON Elements
Now that we have the URL, User-Agent and Query strings all parsed and stored cast as JSON, we can go ahead and extract what we need from the JSON schemas.
Let’s say that we wanted to return the following fields:
id, url protocol, url domain, browser family, os family.
Running this gives us the following output:
The JSONPath Request syntax, is described in the Teradata Documentation.
The JSONPath syntax specification can be found at
Another good resource for testing and gaining an understanding of the JSONPath elements can be found at http://jsonpath.com/