Hadoop 3.x comes with a native support to change the storage system from HDFS to Microsoft Azure Data Lake Storage.
In this blog, we will be discussing about how to integrate your Azure data lake with HDFS. Before going through this blog, we recommend our users to go through our previous blogs in this series
Introduction to Microsoft Azure
Introduction to Azure Data Lake Store
Hadoop 3.x installation guide
Azure Data Lake needs OAUTH2.0 for authenticating your request, for that purpose you need to create a user in Azure active directory and you need to add it into your Data lake.
OAuth2 Support
Usage of Azure Data Lake Storage requires OAuth2 bearer token to be present as part of the HTTPS header as per OAuth2 specification. Valid OAuth2 bearer token should be obtained from Azure Active Directory for valid users who have access to Azure Data Lake Storage Account.
Azure Active Directory (Azure AD) is Microsoft’s multi-tenant cloud based directory and identity management service
Creating Service principle using Azure Active Directory
- Open your Azure portal and click on Azure Active Directory
So, we will summarize what you have generated till now in terms of Hadoop3 configurations
Application ID — Client ID
OAUTH 2.0 Token End point – OAUTH 2.0 Refresh URL
Key value — OAUTH 2.0 Credential or Client secret
Now you need to add these properties in your core-site.xml to make the changes effect.
<property> <name>dfs.adls.oauth2.access.token.provider.type</name> <value>ClientCredential</value> </property> <property> <name>dfs.adls.oauth2.refresh.url</name> <value>YOUR TOKEN ENDPOINT</value> </property> <property> <name>dfs.adls.oauth2.client.id</name> <value>YOUR CLIENT ID</value> </property> <property> <name>dfs.adls.oauth2.credential</name> <value>YOUR CLIENT SECRET</value> </property> <property> <name>fs.adl.impl</name> <value>org.apache.hadoop.fs.adl.AdlFileSystem</value> </property> <property> <name>fs.AbstractFileSystem.adl.impl</name> <value>org.apache.hadoop.fs.adl.Adl</value> </property>
After adding these properties, save and close the file and now open your hadoop-env.sh file and add the class path of Hadoop tools (Azure support comes from the Hadoop tools library)
set HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/*
After adding the class path, save and close the hadoop-env.sh file
Now without starting the HDFS daemons, you can interact with your ADL, here is the data present in my ADL storage.