Have a question?
Message sent Close

40 Must-Know Hadoop Commands and How to Use Them

Hadoop HDFS Commands with Examples

Hadoop is an open-source framework designed for distributed storage and processing of large datasets across clusters of computers using simple programming models. It provides a cost-effective solution for organizations to store, manage, and analyze massive volumes of structured and unstructured data. Hadoop consists of two main components: the Hadoop Distributed File System (HDFS) for storage and the MapReduce programming model for processing.

Hadoop commands are used to interact with the Hadoop ecosystem, primarily through its command-line interface. These commands facilitate various tasks such as managing files and directories in HDFS, transferring data between the local file system and HDFS, setting permissions, and executing MapReduce jobs. Some commonly used Hadoop commands include hadoop fs -ls for listing files in HDFS, hadoop fs -put for copying files to HDFS, hadoop fs -cat for displaying file contents, and hadoop jar for running MapReduce jobs.

1. dfs -mkdir

Purpose: Creates a directory in HDFS.
Example: Create a directory named data in the current user’s home directory:

hadoop fs -mkdir /user/<username>/data

Real-time Scenario: When starting a new data analysis project, use this command to establish a dedicated directory for intermediate or output data.

2. dfs -put

Purpose: Uploads a local file to HDFS.
Example: Upload a file named sales.csv to the /user/<username>/data directory:

hadoop fs -put sales.csv /user/<username>/data/sales.csv

Real-time Scenario: Upload large datasets from your local machine to HDFS for distributed processing using MapReduce or Spark.

3. dfs -get

Purpose: Downloads a file from HDFS to the local file system.
Example: Download the file output.txt from the /user/<username>/results directory to your local machine:

hadoop fs -get /user/<username>/results/output.txt output.txt

Real-time Scenario: After processing data in Hadoop, use this command to retrieve results (e.g., visualizations or reports) to your local machine for further analysis or sharing.

4. dfs -ls

Purpose: Lists the contents of a directory in HDFS.
Example: List the files and subdirectories in the /user/<username>/data directory:

hadoop fs -ls /user/<username>/data

Real-time Scenario: When working with multiple files/directories, use this command to navigate the HDFS filesystem and ensure your data is in the expected locations.

5. dfs -rm

Purpose: Deletes a file or directory from HDFS.
Caution: Use with caution, as deleted data cannot be recovered.
Example: Delete the file temp.txt from the /user/<username>/temp directory:

hadoop fs -rm /user/<username>/temp/temp.txt

Real-time Scenario: After processing or using temporary files, clean up HDFS storage by deleting unnecessary data to optimize resource utilization.

6. fsck

Purpose: Checks the health of the HDFS file system for inconsistencies.
Example: Run a complete filesystem check:

hadoop fs -fsck /

Real-time Scenario: Regularly schedule fsck checks to proactively detect and address data integrity issues that might arise in distributed environments.

7. cat

Purpose: Displays the contents of a file in HDFS.
Example: View the contents of the file data.txt located at /user/<username>/input:

hadoop fs -cat /user/<username>/input/data.txt

Real-time Scenario: During data exploration or debugging, use cat to examine file contents and verify data quality or formatting.

8. text

Purpose: Similar to cat, displays the contents of a file without control characters.
Example: Use text to view a text file more legibly, avoiding unexpected formatting issues:

hadoop fs -text /user/<username>/data/log.txt

Real-time Scenario: When working with log files or human-readable data, text provides a clearer view of the contents compared to cat.

9. tail

Purpose: Displays the last few lines of a file in HDFS.
Example: View the last 10 lines of the file access_log.txt:

hadoop fs -tail -n 10 /user/<username>/logs/access_log.txt

Real-time Scenario: Check recent log entries or the end of a file to identify errors, monitor activity, or gather specific information.

10. head

Purpose: Displays the first few lines of a file in HDFS.
Example: View the first 20 lines of the file sales_report.csv:

hadoop fs -head [-n LINES] <file_path>
  • -n LINES: Optional parameter to specify the number of lines to display (default: 10).

Real-time Example: Imagine you’re analyzing a large log file named access_log.txt stored in HDFS at /user/<username>/logs. You want to quickly check the initial entries to identify any potential errors or unusual activity.

Here’s how you can use the head command:

hadoop fs -head -n 10 /user/<username>/logs/access_log.txt

Additional Notes:

  • You can adjust the -n parameter to control the number of lines displayed.
  • Use tail to view the last few lines of a file instead.
  • Combine head with other commands like grep to filter specific information from the file.

11. df

Purpose: Displays the disk usage of the HDFS file system.
Example: Show the overall disk usage:

hadoop fs -df /

Real-time Scenario: Monitor HDFS capacity and optimize resource allocation by understanding storage consumption.

12. count

Purpose: Counts the number of files, directories, and bytes in a directory.
Example: Count the elements in the /user/<username>/data directory:

hadoop fs -count /user/<username>/data

Real-time Scenario: Manage resources and data inventory by knowing the quantity of files and their sizes.

13. chmod

Purpose: Changes the permissions of a file or directory.
Example: Grant read, write, and execute permissions to everyone for the file config.xml:

hadoop fs -chmod 777 /user/<username>/config/config.xml

Real-time Scenario: Control access to data based on security requirements and user roles.

14. chown

Purpose: Changes the owner of a file or directory.
Example: Change the owner of the file analysis.py to user data_scientist:

hadoop fs -chown data_scientist /user/<username>/analysis/analysis.py

Real-time Scenario: Assign ownership of data and resources to appropriate users or groups.

15. chgrp

Purpose: Changes the group of a file or directory.
Example: Change the group of the directory /user/<username>/shared to data_team:

hadoop fs -chgrp data_team /user/<username>/shared

Real-time Scenario: Manage group access and collaboration on data within your Hadoop cluster.

16. mv

Purpose: Moves or renames a file or directory.
Example: Move the file report.pdf to /user/<username>/reports:

hadoop fs -mv /user/<username>/temp/report.pdf /user/<username>/reports/report.pdf

Real-time Scenario: Organize data efficiently by moving or renaming files to dedicated locations.

17. cp

Purpose: Copies a file or directory within HDFS.
Example: Create a copy of the directory /user/<username>/data:

hadoop fs -cp /user/<username>/data /user/<username>/data_backup

Real-time Scenario: Back up data or create working copies for analysis or processing.

18. stat

Purpose: Displays the status information of a file or directory.
Example: View detailed attributes of the file /user/<username>/metadata.json:

hadoop fs -stat /user/<username>/metadata.json

Real-time Scenario: Gain insights into file properties like block size, replication factor, and timestamps for effective data management.

19. touchz

Purpose: Creates an empty file in HDFS.
Example: Create a zero-byte file named marker.txt:

hadoop fs -touchz /user/<username>/markers/marker.txt

Real-time Scenario: Use this for placeholders or markers in workflows where empty files might be needed.

20. textformat

Purpose: Displays the contents of a file in HDFS in a human-readable format.
Example: View the structured data file customer_data.json in a more presentable way:

hadoop fs -textformat /user/<username>/customers/customer_data.json

Real-time Scenario: Make structured data formats like JSON or XML easier to interpret visually.

21. seq

Purpose: Generates a sequence of numbers.
Example: Create a file named numbers.txt containing numbers from 1 to 100:

hadoop fs -seq -f 1 100 /user/<username>/data/numbers.txt

Real-time Scenario: This is useful for creating test data, generating numerical inputs for MapReduce jobs, or creating sequence files for specific algorithms.

22. put

Purpose: Similar to dfs -put, transfers a local file to HDFS.
Example: Upload a file named data.csv to HDFS using the shorthand way:

hadoop fs put data.csv /user/<username>/data/data.csv

Real-time Scenario: This provides an alternative syntax for convenience, especially when frequently uploading files.

23. get

Purpose: Similar to dfs -get, retrieves a file from HDFS to the local file system.
Example: Download the file results.txt from HDFS using the shorter syntax:

hadoop fs get /user/<username>/output/results.txt results.txt

Real-time Scenario: This offers a concise way to download processed data or results back to your local machine for further analysis.

24. getmerge

Purpose: Merges the contents of a directory in HDFS into a single file in the local file system.
Example: Combine all files in /user/<username>/logs into a single file merged_logs.txt:

hadoop fs getmerge /user/<username>/logs merged_logs.txt

Real-time Scenario: This is helpful for consolidating log files or other datasets spread across multiple files into a single unit for easier processing.

25. checksum

Purpose: Calculates the checksum of a file in HDFS for data integrity verification.
Example: Get the checksum of the file /user/<username>/input/data.txt:

hadoop fs checksum /user/<username>/input/data.txt

Real-time Scenario: This helps ensure data hasn’t been corrupted during storage or transfer, especially in large, distributed datasets.

26. expunge

Purpose: Permanently deletes files from the HDFS trash, bypassing the temporary storage.
Caution: Use with caution, as deleted files are unrecoverable.
Example: Permanently remove the file /user/<username>/temp/old_data.txt:

hadoop fs -expunge /user/<username>/temp/old_data.txt

Real-time Scenario: This is useful for securely deleting sensitive data or reclaiming storage space when temporary files are no longer needed.

27. setrep

Purpose: Changes the replication factor of a file in HDFS, affecting how many copies are stored.
Example: Set the replication factor of the file /user/<username>/critical_data.txt to 3:

hadoop fs -setrep 3 /user/<username>/critical_data.txt

Real-time Scenario: This ensures redundancy and availability of crucial data by increasing its replication across the cluster.

28. distcp

Purpose: Efficiently copies data between different HDFS clusters.
Example: Copy the directory /user/<username>/data from cluster A to cluster B:

hadoop fs -distcp hdfs://clusterA/user/<username>/data hdfs://clusterB/user/<username>/data

Real-time Scenario: This is essential for data movement between Hadoop environments, distributing data for analysis, or sharing between clusters.

29. jar

Purpose: Runs Java applications on the Hadoop cluster.
Example: Execute the MyMainClass from the jar my_jar.jar:

hadoop jar my_jar.jar MyMainClass

Real-time Scenario: This enables utilizing custom Java code for data processing tasks within the Hadoop ecosystem.

30. job

Purpose: Submits and manages MapReduce jobs for parallel processing.
Example: Run a MapReduce job from the jar my_jar.jar with input /input and output /output:

hadoop jar my_jar.jar MyJob /input /output

Real-time Scenario: This forms the core of parallel data processing using Hadoop, orchestrating distributed tasks on large datasets.

31. dfs -mv

Purpose: Equivalent to mv, moves or renames a file or directory within HDFS.
Example: Move the file /user/<username>/old_file.txt to /user/<username>/new_file.txt:

hadoop fs -mv /user/<username>/old_file.txt /user/<username>/new_file.txt

Real-time Scenario: Similar to mv, organize data effectively by renaming or moving files/directories to their designated locations.

32. dfs -cp

Purpose: Equivalent to cp, copies a file or directory within HDFS.
Example: Create a duplicate of the /user/<username>/data directory:

hadoop fs -cp /user/<username>/data /user/<username>/data_backup

Real-time Scenario: Similar to cp, create backups or working copies of data for analysis or processing within the HDFS environment.

33. dfs -du

Purpose: Calculates the disk usage of a specific file or directory in HDFS.
Example: Check the disk space occupied by the directory /user/<username>/logs:

hadoop fs -du /user/<username>/logs

Real-time Scenario: Monitor storage consumption of individual files/directories to optimize resource allocation and identify potential space constraints.

34. dfs -stat

Purpose: Similar to stat, displays detailed information about a file or directory.
Example: View the owner, permissions, block size, and other attributes of the file /user/<username>/config.xml:

hadoop fs -stat /user/<username>/config/config.xml

Real-time Scenario: Gain comprehensive insights into file properties to understand data lineage, access control, and resource utilization.

35. dfs -putmerge

Purpose: Combines multiple local files into a single file and uploads it to HDFS.
Example: Merge files part1.txt and part2.txt into merged_file.txt in HDFS:

hadoop fs -putmerge part1.txt part2.txt /user/<username>/merged_file.txt

Real-time Scenario: This is an efficient way to upload multiple related files as a single unit, especially for processing or further analysis.

36. dfs -append

Purpose: Appends content to an existing file in HDFS.
Example: Add the contents of new_data.txt to the end of existing_data.txt:

hadoop fs -append new_data.txt /user/<username>/existing_data.txt

Real-time Scenario: This is useful for updating or incrementally adding data to existing files without overwriting the original content.

37. dfs -tail -f

Purpose: Continuously follows the growth of a file and displays its tail in real-time.
Example: Monitor the access_log.txt file for new log entries as they appear:

hadoop fs -tail -f /user/<username>/logs/access_log.txt

Real-time Scenario: This is valuable for live troubleshooting, monitoring log files for errors, or observing updates to data streams.

38. dfs -put -l

Purpose: Uploads a local file to HDFS, preserving local file permissions.
Example: Upload sensitive_data.txt to HDFS while maintaining its access control settings:

hadoop fs -put -l sensitive_data.txt /user/<username>/restricted/sensitive_data.txt

Real-time Scenario: This ensures consistent security settings when transferring data to HDFS, adhering to data governance and privacy policies.

39. dfs -lsr

Purpose: Lists the contents of a directory recursively, including subdirectories and their contents.
Example: Get a complete listing of all files and subdirectories within /user/<username>/projects:

hadoop fs -lsr /user/<username>/projects

40. dfs -cp -f

Purpose: The dfs -cp -f command in Hadoop forces the copy of a directory or file from the local file system to HDFS, even if the destination already exists. This overwrites any existing data at the destination path.
Syntax:

hadoop fs -cp -f <source_path> <destination_path>

-f: Flag indicating to force the copy even if the destination exists.

Real-time Example: Suppose you have a directory named local_data on your local machine containing updated reports. You want to overwrite the existing /user/<username>/reports directory in HDFS with this newer data, even if it currently contains different reports.

Here’s how you can use the dfs -cp -f command:

hadoop fs -cp -f local_data /user/<username>/reports

This command will:

  1. Copy the contents of the local_data directory recursively to the /user/<username>/reports directory in HDFS.
  2. Overwrite any existing files or directories within /user/<username>/reports with the files and directories from local_data.

Important Considerations:

  • Use with caution: This command forcefully overwrites existing data. Double-check your source and destination paths to avoid accidental data loss.
  • Alternatives: If you want to avoid overwriting, consider using dfs -cp without the -f flag, which will only copy if the destination doesn’t exist.
  • Partial overwrites: If the source and destination have overlapping directory structures, only the conflicting files or directories will be overwritten.

Additional Notes:

  • This command can be useful for updating existing datasets in HDFS with newer versions.
  • It can also be used to create backups of important directories, knowing that the backup will overwrite any previous version.
  • Use this command responsibly and always have a clear understanding of the source and destination paths.

Click here to read more related topics.

Click here to visit Hadoop official website.