Hadoop FS commands are used for interacting with the Hadoop Distributed File System (HDFS) and performing various file and directory operations within a Hadoop ecosystem. These commands are essential for managing data stored in HDFS, enabling users to perform tasks such as listing, copying, moving, and deleting files and directories.
Key Features:
- File Management: Perform basic file operations such as uploading, downloading, and deleting files.
- Directory Operations: Create, delete, and list directories within HDFS.
- File Access: Check file permissions, view file contents, and monitor file status.
- Data Transfer: Move or copy files between HDFS and local file systems or between different HDFS locations.
- Utilities: Utilize commands for file size, storage statistics, and more.
These commands are crucial for managing and navigating data within the Hadoop ecosystem, providing the necessary tools for efficient data handling and processing in big data environments.
Table of Contents
ToggleHadoop HDFS Commands with Examples
Hadoop is an open-source framework designed for distributed storage and processing of large datasets across clusters of computers using simple programming models. It provides a cost-effective solution for organizations to store, manage, and analyze massive volumes of structured and unstructured data. Hadoop consists of two main components: the Hadoop Distributed File System (HDFS) for storage and the MapReduce programming model for processing.
Hadoop commands are used to interact with the Hadoop ecosystem, primarily through its command-line interface. These commands facilitate various tasks such as managing files and directories in HDFS, transferring data between the local file system and HDFS, setting permissions, and executing MapReduce jobs. Some commonly used Hadoop commands include hadoop fs -ls
for listing files in HDFS, hadoop fs -put
for copying files to HDFS, hadoop fs -cat
for displaying file contents, and hadoop jar
for running MapReduce jobs.
Top 40 Hadoop FS Commands
1. Hadoop dfs -mkdir command
Purpose: Creates a directory in HDFS.
Example: Create a directory named data
in the current user’s home directory:
hadoop fs -mkdir /user/<username>/data
Real-time Scenario: When starting a new data analysis project, use this command to establish a dedicated directory for intermediate or output data.
2. Hadoop dfs -put
Purpose: Uploads a local file to HDFS.
Example: Upload a file named sales.csv
to the /user/<username>/data
directory:
hadoop fs -put sales.csv /user/<username>/data/sales.csv
Real-time Scenario: Upload large datasets from your local machine to HDFS for distributed processing using MapReduce or Spark.
3. dfs -get
Purpose: Downloads a file from HDFS to the local file system.
Example: Download the file output.txt
from the /user/<username>/results
directory to your local machine:
hadoop fs -get /user/<username>/results/output.txt output.txt
Real-time Scenario: After processing data in Hadoop, use this command to retrieve results (e.g., visualizations or reports) to your local machine for further analysis or sharing.
4. dfs -ls
Purpose: Lists the contents of a directory in HDFS.
Example: List the files and subdirectories in the /user/<username>/data
directory:
hadoop fs -ls /user/<username>/data
Real-time Scenario: When working with multiple files/directories, use this command to navigate the HDFS filesystem and ensure your data is in the expected locations.
5. dfs -rm
Purpose: Deletes a file or directory from HDFS.
Caution: Use with caution, as deleted data cannot be recovered.
Example: Delete the file temp.txt
from the /user/<username>/temp
directory:
hadoop fs -rm /user/<username>/temp/temp.txt
Real-time Scenario: After processing or using temporary files, clean up HDFS storage by deleting unnecessary data to optimize resource utilization.
6. fsck
Purpose: Checks the health of the HDFS file system for inconsistencies.
Example: Run a complete filesystem check:
hadoop fs -fsck /
Real-time Scenario: Regularly schedule fsck
checks to proactively detect and address data integrity issues that might arise in distributed environments.
7. cat
Purpose: Displays the contents of a file in HDFS.
Example: View the contents of the file data.txt
located at /user/<username>/input
:
hadoop fs -cat /user/<username>/input/data.txt
Real-time Scenario: During data exploration or debugging, use cat
to examine file contents and verify data quality or formatting.
8. text
Purpose: Similar to cat
, displays the contents of a file without control characters.
Example: Use text
to view a text file more legibly, avoiding unexpected formatting issues:
hadoop fs -text /user/<username>/data/log.txt
Real-time Scenario: When working with log files or human-readable data, text
provides a clearer view of the contents compared to cat
.
9. tail
Purpose: Displays the last few lines of a file in HDFS.
Example: View the last 10 lines of the file access_log.txt
:
hadoop fs -tail -n 10 /user/<username>/logs/access_log.txt
Real-time Scenario: Check recent log entries or the end of a file to identify errors, monitor activity, or gather specific information.
10. head
Purpose: Displays the first few lines of a file in HDFS.
Example: View the first 20 lines of the file sales_report.csv
:
hadoop fs -head [-n LINES] <file_path>
- -n LINES: Optional parameter to specify the number of lines to display (default: 10).
Real-time Example: Imagine you’re analyzing a large log file named access_log.txt
stored in HDFS at /user/<username>/logs
. You want to quickly check the initial entries to identify any potential errors or unusual activity.
Here’s how you can use the head
command:
hadoop fs -head -n 10 /user/<username>/logs/access_log.txt
Additional Notes:
- You can adjust the
-n
parameter to control the number of lines displayed. - Use
tail
to view the last few lines of a file instead. - Combine
head
with other commands likegrep
to filter specific information from the file.
11. df
Purpose: Displays the disk usage of the HDFS file system.
Example: Show the overall disk usage:
hadoop fs -df /
Real-time Scenario: Monitor HDFS capacity and optimize resource allocation by understanding storage consumption.
12. count
Purpose: Counts the number of files, directories, and bytes in a directory.
Example: Count the elements in the /user/<username>/data
directory:
hadoop fs -count /user/<username>/data
Real-time Scenario: Manage resources and data inventory by knowing the quantity of files and their sizes.
13. chmod
Purpose: Changes the permissions of a file or directory.
Example: Grant read, write, and execute permissions to everyone for the file config.xml
:
hadoop fs -chmod 777 /user/<username>/config/config.xml
Real-time Scenario: Control access to data based on security requirements and user roles.
14. chown
Purpose: Changes the owner of a file or directory.
Example: Change the owner of the file analysis.py
to user data_scientist
:
hadoop fs -chown data_scientist /user/<username>/analysis/analysis.py
Real-time Scenario: Assign ownership of data and resources to appropriate users or groups.
15. chgrp
Purpose: Changes the group of a file or directory.
Example: Change the group of the directory /user/<username>/shared
to data_team
:
hadoop fs -chgrp data_team /user/<username>/shared
Real-time Scenario: Manage group access and collaboration on data within your Hadoop cluster.
16. mv
Purpose: Moves or renames a file or directory.
Example: Move the file report.pdf
to /user/<username>/reports
:
hadoop fs -mv /user/<username>/temp/report.pdf /user/<username>/reports/report.pdf
Real-time Scenario: Organize data efficiently by moving or renaming files to dedicated locations.
17. cp
Purpose: Copies a file or directory within HDFS.
Example: Create a copy of the directory /user/<username>/data
:
hadoop fs -cp /user/<username>/data /user/<username>/data_backup
Real-time Scenario: Back up data or create working copies for analysis or processing.
18. stat
Purpose: Displays the status information of a file or directory.
Example: View detailed attributes of the file /user/<username>/metadata.json
:
hadoop fs -stat /user/<username>/metadata.json
Real-time Scenario: Gain insights into file properties like block size, replication factor, and timestamps for effective data management.
19. touchz
Purpose: Creates an empty file in HDFS.
Example: Create a zero-byte file named marker.txt
:
hadoop fs -touchz /user/<username>/markers/marker.txt
Real-time Scenario: Use this for placeholders or markers in workflows where empty files might be needed.
20. textformat
Purpose: Displays the contents of a file in HDFS in a human-readable format.
Example: View the structured data file customer_data.json
in a more presentable way:
hadoop fs -textformat /user/<username>/customers/customer_data.json
Real-time Scenario: Make structured data formats like JSON or XML easier to interpret visually.
21. seq
Purpose: Generates a sequence of numbers.
Example: Create a file named numbers.txt
containing numbers from 1 to 100:
hadoop fs -seq -f 1 100 /user/<username>/data/numbers.txt
Real-time Scenario: This is useful for creating test data, generating numerical inputs for MapReduce jobs, or creating sequence files for specific algorithms.
22. put
Purpose: Similar to dfs -put
, transfers a local file to HDFS.
Example: Upload a file named data.csv
to HDFS using the shorthand way:
hadoop fs put data.csv /user/<username>/data/data.csv
Real-time Scenario: This provides an alternative syntax for convenience, especially when frequently uploading files.
23. get
Purpose: Similar to dfs -get
, retrieves a file from HDFS to the local file system.
Example: Download the file results.txt
from HDFS using the shorter syntax:
hadoop fs get /user/<username>/output/results.txt results.txt
Real-time Scenario: This offers a concise way to download processed data or results back to your local machine for further analysis.
24. getmerge
Purpose: Merges the contents of a directory in HDFS into a single file in the local file system.
Example: Combine all files in /user/<username>/logs
into a single file merged_logs.txt
:
hadoop fs getmerge /user/<username>/logs merged_logs.txt
Real-time Scenario: This is helpful for consolidating log files or other datasets spread across multiple files into a single unit for easier processing.
25. checksum
Purpose: Calculates the checksum of a file in HDFS for data integrity verification.
Example: Get the checksum of the file /user/<username>/input/data.txt
:
hadoop fs checksum /user/<username>/input/data.txt
Real-time Scenario: This helps ensure data hasn’t been corrupted during storage or transfer, especially in large, distributed datasets.
26. expunge
Purpose: Permanently deletes files from the HDFS trash, bypassing the temporary storage.
Caution: Use with caution, as deleted files are unrecoverable.
Example: Permanently remove the file /user/<username>/temp/old_data.txt
:
hadoop fs -expunge /user/<username>/temp/old_data.txt
Real-time Scenario: This is useful for securely deleting sensitive data or reclaiming storage space when temporary files are no longer needed.
27. setrep
Purpose: Changes the replication factor of a file in HDFS, affecting how many copies are stored.
Example: Set the replication factor of the file /user/<username>/critical_data.txt
to 3:
hadoop fs -setrep 3 /user/<username>/critical_data.txt
Real-time Scenario: This ensures redundancy and availability of crucial data by increasing its replication across the cluster.
28. distcp
Purpose: Efficiently copies data between different HDFS clusters.
Example: Copy the directory /user/<username>/data
from cluster A to cluster B:
hadoop fs -distcp hdfs://clusterA/user/<username>/data hdfs://clusterB/user/<username>/data
Real-time Scenario: This is essential for data movement between Hadoop environments, distributing data for analysis, or sharing between clusters.
29. jar
Purpose: Runs Java applications on the Hadoop cluster.
Example: Execute the MyMainClass
from the jar my_jar.jar
:
hadoop jar my_jar.jar MyMainClass
Real-time Scenario: This enables utilizing custom Java code for data processing tasks within the Hadoop ecosystem.
30. job
Purpose: Submits and manages MapReduce jobs for parallel processing.
Example: Run a MapReduce job from the jar my_jar.jar
with input /input
and output /output
:
hadoop jar my_jar.jar MyJob /input /output
Real-time Scenario: This forms the core of parallel data processing using Hadoop, orchestrating distributed tasks on large datasets.
31. dfs -mv
Purpose: Equivalent to mv
, moves or renames a file or directory within HDFS.
Example: Move the file /user/<username>/old_file.txt
to /user/<username>/new_file.txt
:
hadoop fs -mv /user/<username>/old_file.txt /user/<username>/new_file.txt
Real-time Scenario: Similar to mv
, organize data effectively by renaming or moving files/directories to their designated locations.
32. dfs -cp
Purpose: Equivalent to cp
, copies a file or directory within HDFS.
Example: Create a duplicate of the /user/<username>/data
directory:
hadoop fs -cp /user/<username>/data /user/<username>/data_backup
Real-time Scenario: Similar to cp
, create backups or working copies of data for analysis or processing within the HDFS environment.
33. dfs -du
Purpose: Calculates the disk usage of a specific file or directory in HDFS.
Example: Check the disk space occupied by the directory /user/<username>/logs
:
hadoop fs -du /user/<username>/logs
Real-time Scenario: Monitor storage consumption of individual files/directories to optimize resource allocation and identify potential space constraints.
34. dfs -stat
Purpose: Similar to stat
, displays detailed information about a file or directory.
Example: View the owner, permissions, block size, and other attributes of the file /user/<username>/config.xml
:
hadoop fs -stat /user/<username>/config/config.xml
Real-time Scenario: Gain comprehensive insights into file properties to understand data lineage, access control, and resource utilization.
35. dfs -putmerge
Purpose: Combines multiple local files into a single file and uploads it to HDFS.
Example: Merge files part1.txt
and part2.txt
into merged_file.txt
in HDFS:
hadoop fs -putmerge part1.txt part2.txt /user/<username>/merged_file.txt
Real-time Scenario: This is an efficient way to upload multiple related files as a single unit, especially for processing or further analysis.
36. dfs -append
Purpose: Appends content to an existing file in HDFS.
Example: Add the contents of new_data.txt
to the end of existing_data.txt
:
hadoop fs -append new_data.txt /user/<username>/existing_data.txt
Real-time Scenario: This is useful for updating or incrementally adding data to existing files without overwriting the original content.
37. dfs -tail -f
Purpose: Continuously follows the growth of a file and displays its tail in real-time.
Example: Monitor the access_log.txt
file for new log entries as they appear:
hadoop fs -tail -f /user/<username>/logs/access_log.txt
Real-time Scenario: This is valuable for live troubleshooting, monitoring log files for errors, or observing updates to data streams.
38. dfs -put -l
Purpose: Uploads a local file to HDFS, preserving local file permissions.
Example: Upload sensitive_data.txt
to HDFS while maintaining its access control settings:
hadoop fs -put -l sensitive_data.txt /user/<username>/restricted/sensitive_data.txt
Real-time Scenario: This ensures consistent security settings when transferring data to HDFS, adhering to data governance and privacy policies.
39. dfs -lsr
Purpose: Lists the contents of a directory recursively, including subdirectories and their contents.
Example: Get a complete listing of all files and subdirectories within /user/<username>/projects
:
hadoop fs -lsr /user/<username>/projects
40. dfs -cp -f
Purpose: The dfs -cp -f
command in Hadoop forces the copy of a directory or file from the local file system to HDFS, even if the destination already exists. This overwrites any existing data at the destination path.
Syntax:
hadoop fs -cp -f <source_path> <destination_path>
-f: Flag indicating to force the copy even if the destination exists.
Real-time Example: Suppose you have a directory named local_data
on your local machine containing updated reports. You want to overwrite the existing /user/<username>/reports
directory in HDFS with this newer data, even if it currently contains different reports.
Here’s how you can use the dfs -cp -f
command:
hadoop fs -cp -f local_data /user/<username>/reports
This command will:
- Copy the contents of the
local_data
directory recursively to the/user/<username>/reports
directory in HDFS. - Overwrite any existing files or directories within
/user/<username>/reports
with the files and directories fromlocal_data
.
Important Considerations:
- Use with caution: This command forcefully overwrites existing data. Double-check your source and destination paths to avoid accidental data loss.
- Alternatives: If you want to avoid overwriting, consider using
dfs -cp
without the-f
flag, which will only copy if the destination doesn’t exist. - Partial overwrites: If the source and destination have overlapping directory structures, only the conflicting files or directories will be overwritten.
Additional Notes:
- This command can be useful for updating existing datasets in HDFS with newer versions.
- It can also be used to create backups of important directories, knowing that the backup will overwrite any previous version.
- Use this command responsibly and always have a clear understanding of the source and destination paths.
Click here to read more related topics.
Click here to visit Hadoop official website.