Quick guide to Bash commands for Big Data Analysis

In this post “Quick guide to Bash commands for Big Data Analysis”, we are going to explore some basic Bash/Linux commands which are very useful in data analysis. Bash is a command line interpreter for the GNU OS(a UNIX like free OS) which typically runs in a command line window. It accepts the command submitted by the end user and transforms it into a machine understandable format and sends it to the kernel. If we want to execute a batch of bash commands in a go, we can wrap a set of bash commands in a text file and save it with a .sh extension and then we can call this file in order to execute it.

Since, Hadoop was developed on top of Linux OS, mostly Linux based machines are used in production environment. Therefore, in order to interact with Hadoop clusters, we must have a good understanding of Bash commands. It is the default command line interpreter shipped with Linux based machines and can be very useful for exploring text files located on a Hadoop cluster.

Let’s have a look at the basic bash commands which are useful for a Big Data developer or for a Data analyst.

Bash commands and their uses:

ls command

ls command is used to list the contents of a directory. For example:

ls /d/BashTest/

ls-command
ls-command

ls -l /d/BashTest/

-l option is used to list the contents of a directory in long format which also includes Unix file types, permissions, number of hard links, owner, group, size, last-modified date along with filename.

ls -l command
ls -l command

ls -lh /d/BashTest/

-lh option is used to list the contents of a directory and prints the sizes in a human readable format. (e.g. 10K, 100M, 1G, etc.)

ls -lh command
ls -lh command

ls -lS /d/BashTest/

-lS option is used to sort the contents based on the file size.

pwd command

This is an abbreviation for “print working directory”, it is used to get the full path name of the current working directory.

pwd

pwd command
pwd command

mkdir command

This is an abbreviation for “make directory” and it is used to create a new directory on the file system. Below command will create a new folder named “Test” at D:/BashTest location.

mkdir /d/BashTest/Test

mv command

mv command is used to rename a file or directory or to move them to a different location. To rename a file, the source and the destination should be in the same directory. However, to move a file, the source and destination directory should be at different locations. The syntax is as:

mv old_location new_location

So, if we have a file 01.txt located at D:\BashTest and need to move it to a new location at D:\BashTest\Test, we can use below command.

mv /d/BashTest/01.txt /d/BashTest/Test

rm command

rm command is used to remove files or directories. If we need to delete a file, we can use rm command as:

rm /d/BashTest/Test/01.txt

-r option is used to remove directories and their contents recursively.

rm -r /d/BashTest/Test

-i option is used to delete files in interactive mode. Below command will remove all the text files from D:\BashTest\Test folder in an interactive mode.

rm -i /d/BashTest/Test/*.txt 

cd command

cd command is used to change the current working directory.

To change the current directory to the root directory:

cd /

To change the current directory to the parent directory:

cd ..

To change the current directory to the home directory:

cd ~

cp command

cp command is used to create a copy of files and directories.

cp /d/BashTest/TestFile.txt      /d/BashTest/TestFolder/

-r option is used to copy directories recursively.

cp -r /d/BashTest/Test             /d/BashTest/TestFolder/

cat command

cat command is used to print the contents of a file to the standard output window(command line). We can also use it to copy and or append text files into an existing document.

cat /d/BashTest/TestFile.txt

cat command
cat command

cut command

cut command is used to cut sections of each line of input files by fields, characters or bytes, separated by a delimiter and writes result to the standard output window. The default delimiter is a tab character.
So, if we have a pipe(|) delimited file (as displayed in cat command’s output) and we need to extract the first column from it, we can use below command:

cut -d “|” -f1 /d/BashTest/TestFile.txt

cut command
cut command

Note: It will not change the original input file.

grep command

grep command is used to extract each line from the input files which matches with the given regular expression pattern and then writes it to the standard output. It is an abbreviation for “global regular expression print”. So, if we need to extract all the lines which contains ‘line number 1000’, we can use below command:

grep      ‘line number 1000’      /d/BashTest/TestFile.txt

grep command
grep command

Note: It will not change the original input file.

head command

head command is used to write the starting lines of a text file to the standard output. By default, it outputs the first 10 lines of the input file. The syntax is as:
head -n

where n is the required number of lines.

head -10 /d/BashTest/TestFile.txt

tail command

tail command is used to write the lines from the end of a text file to the standard output. By default, it outputs the last 10 lines of the input file. The syntax is as:
tail -n 

where n is the required number of lines.

tail -10 /d/BashTest/TestFile.txt

touch command

touch command is used to update the last access and or modification date of a file or directory. We can also use it to create an empty file.

touch /d/BashTest/TestFile.txt

If we want to create a new empty file, we can use a non existing file name instead of an existing file name.

touch /d/BashTest/NewTestFile.txt

tr command

tr command is used to replace or remove specific characters from the standard input and to write it to the standard output. So, if we want to replace string ‘line number’ with string ‘Line Number’ in the output of cat command, we can use below command:

cat /d/BashTest/TestFile.txt       |       tr      ‘line number’       ‘Line Number’

tr command
tr command

-d option is used to delete characters instead of translating it. If we want to delete all the spaces in the output, we can chain the output of the cat command with tr command as below:

cat /d/BashTest/TestFile.txt    |        tr -d ‘ ‘

Note: This command will not change the original input file.

wc command

wc command is used to print number of lines, words and bytes for each input file. It is an abbreviation for “word count”.

wc /d/BashTest/TestFile.txt

wc command
wc command

-c option is used to print only the number of characters.

wc -c /d/BashTest/TestFile.txt

-l option is used to print only the number of lines.

wc -l /d/BashTest/TestFile.txt

sort command

sort command is used to sort the contents of a text file in the standard output.

sort /d/BashTest/TestFile.txt

-r option is used to sort the output in the reverse order.

sort -r /d/BashTest/TestFile.txt

-k option is used to sort the content by column number. Here, we are sorting the file based on the first column.

sort -k 1 /d/BashTest/TestFile.txt

-n option is used to compare according to string numerical value. Below, we are sorting the content based on first column’s numerical value and in reverse order.

sort -k 1nr /d/BashTest/TestFile.txt

Note: This command will not change the original input file.

vim command

vim is a text editor which stands for “vi improved”. It can be used to edit existing files in vim editor.

du command

du command is used to display the file space usage under a particular directory or files on a file system.

du /d/BashTest/

-h option is used to get the file size in human readable format:

du -h /d/BashTest/

du command
du command

df command

df command is used to display the amount of available disk space being used by the file systems.

df

-h option is used to display the file size in human readable format:

df -h

df command
df command

man command

man command is used to get the manual pages about the commands. On windows machine, we can use –help option to get the command documentation. For example: cd –help.

more command

more command is used to display the contents of a text file one screen at a time. We can chain more command to the output of other commands in order to display the results one screen at a time.

less command

less command is similar to more, but it has  some extended capabilities of allowing both forward and backward scrolling through the file.

ps command

ps command is used to get the information about the currently running processes with their process identification numbers.

ps command
ps command

top command

top command is used to produce an ordered list of the running processes selected by user-specified criteria. It also updates it periodically.

kill command

kill command is used to kill a process.

Bash special symbolic prompt operators for Big Data Analysis

Bash also provides some special symbolic prompt operators which can be very handy in data analysis.

pipe (|) prompt operator

| operator is used to convert the output of the first command as the input of the second command. It is very useful for command chaining.

cat /d/BashTest/TestFile.txt | tr ‘line number’ ‘Line Number’

double pipe (||) prompt operator

|| is used when we want to execute the second command only if the execution of the first command fails. It will never execute the second command if the first command gets executed successfully.

> prompt operator

> is used to overwrite the standard output to a file if it exists already or to create a new one. So, if we want to output the result of ls command in a new text file we can use:

ls /d/BashTest/      >      /d/BashTest/NewTextFile.txt

>> prompt operator

>> is used to append the standard output to a file if it exists already or to create a new one. So, if we want to append the output of the ls command to an existing text file named as “NewTextFile.txt”, we can use:

ls /d/BashTest/      >>     /d/BashTest/NewTextFile.txt

& prompt operator

& is used to run a process in the background.

&& prompt operator

&& is used to execute the second command only if the execution of the first command succeeded.

You can also refer to these links if you want to understand these commands in more detail.

https://ss64.com/bash/

https://www.gnu.org/software/bash/manual/bash.html

Thanks for the reading. Please share your inputs in the comment section.

Rate This
[Total: 2 Average: 5]

1 thought on “Quick guide to Bash commands for Big Data Analysis”

Leave a Comment

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.