Big Data Hadoop & Spark - AdvancedData Science and Artificial Intelligence

Big Data Python

Python and Big Data is being widely accepted by the IT industry due to several reasons.

If you are new to this topic, then you should first look at the blogs below.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

Going further we will see how to get connection between python and big data environment.

Note: Here are the tool versions we will be using to establish connection in between.

Centos: 6.9

Hadoop : 2.6

Python: 3.6.9

Pip 3

The system is linux and must have hadoop installed.

All the linux systems come with python installed.

To install hadoop you can visit https://acadgild.com/blog/key-configurations-in-hadoop-installation.

My system already has hadoop 2.6 and python2.6.

Below are commands we need to run to get the above mentioned python and pip version.

​sudo rm -rf /usr/bin/python
sudo ln -s /usr/bin/python2.6 /usr/bin/python

cd /usr/local/bin/

yum install gcc openssl-devel bzip2-devel sqlite-devel
wget https://www.python.org/ftp/python/3.6.9/Python-3.6.9.tgz
tar xzf Python-3.6.9.tgz

cd Python-3.6.9
./configure --enable-optimizations
make altinstall

ls
#and see if python3.6 is present

curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py"
python get-pip.py

ls
#and see if pip3 is present


cd
sudo vi .bashrc
alias python='/usr/local/bin/python3.6'
alias pip='/usr/local/bin/pip3.6'

:wq

source .bashrc


python -V

pip -V

Once after verifying the versions we want, we can now make connections between python and hdfs.

Open python shell

from hdfs import InsecureClient
client = InsecureClient('http://localhost:50070')

#With this the connection has been established. Below are a few commands to read and write to HDFS.

content = client.content('/')
content
#u is unicode string

for k,v in content.items():
    print("{} = {}".format(k,v))

fnames = client.list('/')
fnames

status = client.status('/')
status
for k,v in status.items():
    print("{} = {}".format(k,v))

client.rename(''','')

#u is a unicode string

client.download('/user/data/','/home/acadgild/Desktop/',n_threads=5)


client.upload('/user/data/uplaod/Salary_Data.csv','/home/acadgild/Desktop/',n_threads=5)

r_file = open('/home/acadgild/Desktop/Salary_Data.csv','r')
temp =[]
for line in r_file:
  if line.startswith('p'):
     temp.append(line)
client.write('/user/',data='\n'.join(temp))

client.delete('/user/data/',recursive=True)

Dataset link:

https://acadgildsite.s3.amazonaws.com/wordpress_images/datasets/big_data_python/ratings.csv

This is how we connect Hadoop using Python. I hope you liked this blog. Do leave us a comment for any query or suggestions.

Keep visiting our website for more blogs.

prateek

An alumnus of the NIE-Institute Of Technology, Mysore, Prateek is an ardent Data Science enthusiast. He has been working at Acadgild as a Data Engineer for the past 3 years. He is a Subject-matter expert in the field of Big Data, Hadoop ecosystem, and Spark.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close