My Oracle Support Banner

Hadoop Frequently Asked Questions (FAQ) (Doc ID 1530797.1)

Last updated on MAY 24, 2021

Applies to:

Big Data Appliance Integrated Software - Version 2.0.1 and later
Linux x86-64

Purpose

This document provides answers to frequently asked questions about Hadoop distributed by Cloudera for use on the Oracle Big Data Appliance(BDA).

Questions and Answers

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Purpose
Questions and Answers
 Is the environment variable $HADOOP_HOME used in CDH 4.1.2 ?
 In lieu of the environment variable $HADOOP_HOME what should be used in CDH 4.1.2 ?
 Should OS disks (/dev/sda, /dev/sdb) be used to store local data? HDFS data?
 How can data on the OS disks be cleaned up, since storing it there is not recommended?
 Does the Cloudera CDH Client have to be installed on all Exadata DB nodes?
 If a disk goes bad and is replaced can you verify the disk is functional with  regards to HDFS?
 If one of the services managed by Cloudera Manager(CM) goes into "BAD" health, is there a recommended order for checking the status of services?
 If the nodes of the BDA cluster have been up for close to 200 days is a reboot recommended?
 Can you decommision non-critical nodes from BDA HDFS cluster , inorder to install NoSQL ?
 For HA testing is it possible to relocate Hive services to a different node after a Hive node failure?
 What options are available for migrating service roles on the BDA?
 What are the options for destroying i.e. performing a non-recoverable delete all the data stored on the DataNodes in HDFS?
 When destroying HDFS data is there an option for replacing the data blocks on all DataNodes with some random pattern of bytes (0s/1s or something else)? In other words is there a way to securely delete sensitive data from HDFS by overwriting the physical disk locations with new data i.e. with randomly generated output?
 Running a very long reducer seems to be filling one DataNode.  Why would that be?
 Why are zookeeper, hdfs, mapred, yarn, hive, sqoop, users in /etc/passwd?
 Is it possible to limit memory and CPU consumption to different BDA processes to not exceed a specific set threshold?
 Is HDFS Encryption and Navigator Key Trustee of the Cloudera stack on the BDA 4.1 with CDH 5.3.0 supported?
 Name Node data directories are configured on the root ("/") mount point. We noticed that the "Namenode Data Directories" (dfs.name.dir) is set to "/opt/hadoop/dfs/nn" which is on the root ("/") partition. The problem with this setting is that the root partition could be filled up by things like the "/var/log", "/tmp", "/home" etc. Is there a particular reason for configuring the Namenode data directory on the root partition rather than its own dedicated disk/mount point?
 Zookeeper data directory is configured to be on the root ("/") mount point. We noticed that the Zookeeper data directory ("dataDir") is also configured to write out the the root mount point ("/var/lib/zookeeper"). This approach has the same problem as mentioned in point 1 above. On top of that, our understanding is that Zookeeper is very sensitive to disk latencies and to make sure that Zookeeper doesn't face disk latencies, one should configure the "dataDir" to be on a dedicated disk. Is there any particular reason why Zookeeper is not configured with dedicated disks/mount points for the "dataDir"?
 Journal node Edits directory ("dfs.journalnode.edits.dir") is configured to be on the root ("/") mount point. Is there any particular reason why the Journal Nodes are not configured with dedicated disks/mount points for the "dfs.journalnode.edits.dir"?
 To solve the above mentioned issues (name nodes, zookeeper nodes, and Journal nodes) we are planning to use some of the data disks (/u0x) on node01, node02 and node03 to dedicated them for the above mentioned purposes. Are there any supportability issues if we go down this path? Any suggestions on how to deal with this situation that doesn't affect the supportability of the cluster from the Oracle side are much appreciated!
 If implementing a script to do replication of hdfs and hive using the Cloudera API is it possible to use the current timezone for timing the replication schedule?
 Is there a property, or method, to copy data from a local file system to hdfs in parallel to speed data copies up?
 Where can install information be found for Cloudera Data Science Workbench?
References

My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.