How to disable Copy, Paste and Text Selection using CSS

Disable Copy/Paste and Text Selection using CSS and Javascript.

CSS Code to disable text selection and highlighting.

This CSS code will disable text highlighting and selecting for your website/blog post. This CSS code works perfectly in Internet Explorer, Mozilla Firefox and Google Chrome. In blogger you may put the code inside the .post-body section.
-webkit-user-select: none;
-khtml-user-drag: none;
-khtml-user-select: none;
-moz-user-select: none;
-moz-user-select: -moz-none;
-ms-user-select: none;
-o-user-select: none;
user-select: none;
One thing to note about the above CSS code is that in Firefox, it will also disable Copy/Paste of the content of your post/website. However, the copy will still be possible in Chrome browser.

CSS Code to enable text selection in Blockquote.

There may be cases where a certain portion of your post may still need the highlighting and selection, let say in the blockquote area where you may write certain sample code. In that case just you need to put
.post blockquote {
-webkit-touch-callout: text;
-khtml-user-select: text;
 -moz-user-select: text;
 -ms-user-select: text;
 user-select: text;
 -o-user-select:text;
 -webkit-user-select: text;
}

Disable Copy/Paste of you content.

To disable copy/paste in the blogger and Wordpress posts copy the below code into you theme template. This will make the task of copying text a little complex for content scrappers.
<script type='text/JavaScript'>
function killCopy(e){
return false
}
function reEnable(){
return true
}
document.onselectstart=new Function (&quot;return false&quot;)
if (window.sidebar){
document.onmousedown=killCopy
document.onclick=reEnable
}
</script>

Online HTML to XML Parser

HTML to XML Converter/Parser Tool. Convert your Blogger code from HTML to XML.

Convert your Adsense, Chitika or any other HTML code to XML and make it compatible with blogger templates.

Since Blogger doesn't allow HTML code, this tool will convert all the Javascript and HTML code to XML safely for use in Blogger.
Enter raw code here


Get escaped code here

HTML Table Generator


Create a CSS styled HTML Table of your choice in real quick time.

This online tool will let you build and design your HTML table from a variety of styles. You just need to copy the HTML code generated into your website.

Please use the options provided below to create your custom HTML table.

Table Style: Highlight:

Table Structure: Columns Rows Cell Info:

Copy the HTML code generated below into your website

How HTML Table will look in your website or blog.

Header 1Header 2Header 3Header 4Header 5
Row:1 Cell:1Row:1 Cell:2Row:1 Cell:3Row:1 Cell:4Row:1 Cell:5
Row:2 Cell:1Row:2 Cell:2Row:2 Cell:3Row:2 Cell:4Row:2 Cell:5
Row:3 Cell:1Row:3 Cell:2Row:3 Cell:3Row:3 Cell:4Row:3 Cell:5
Row:4 Cell:1Row:4 Cell:2Row:4 Cell:3Row:4 Cell:4Row:4 Cell:5
Row:5 Cell:1Row:5 Cell:2Row:5 Cell:3Row:5 Cell:4Row:5 Cell:5
Row:6 Cell:1Row:6 Cell:2Row:6 Cell:3Row:6 Cell:4Row:6 Cell:5

CSS Code to make the footer stay at the bottom of a page

HTML page footer stay at bottom of the page.

CSS code for Footer at the bottom of a page.

Main body page is stretched to a 100% of the page. The footer is then given the below CSS rules
#footer {
clear: both;
position: relative;
z-index: 10;
text-align: center;
height:30px;
}

CSS code for Fixed Footer at the bottom.

There is another approach which will work on all browsers, though it will appear at the bottom, the footer will be placed over the content. As such, the footer will be at the bottom of the window even when you scroll.
#footer {
position:fixed;
bottom:0;
left:0;
right:0;
width:100%;
z-index: -999;
overflow: hidden;
}

Hadoop Cluster Interview Questions Answers

Apache Hadoop Cluster Interview Questions.

Explain about the Hadoop-core configuration files?
Hadoop core is specified by two resources. Its is configured by two well written xml files which are loaded from the classpath:
1. hadoop-default.xml  -  Read-only defaults for Hadoop, suitable for a single machine instance
2. hadoop-site.xml. - It specifies the site configuration for Hadoop distribution. The cluster specific information is also provided by the Hadoop administrator.

Explain in brief the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are:
1. Standalone (local) mode - No Hadoop daemons running, everything runs on a single Java Virtual machine.
2. Pseudo-distributed mode - Daemons run on the local machine, thereby simulating a cluster on a smaller sclae.
3. Fully distributed mode - Runs on a cluster of machines.

Explain what are the features of Stand alone (local) mode?

In stand-alone or local mode there are no Hadoop daemons running,  and everything runs on a single Java process. Hence we don't get the benefit of distributing the code across a cluster of machines. Since, it has no DFS, it utilizes the local file system. This mode is suitable only for running MapReduce programs by developers during various stages of development. Its the best environment for learning and good for debugging purposes.

What are the main features of Pseudo mode?

In Pseudo-distributed mode, each Hadoop daemon runs in a separate Java process, as such it simulates a cluster though on a small scale. This mode is used both for development and QA environments. Here , we need to do the configuration changes.

What are the features of Fully Distributed mode?

In Fully Distributed mode, the clusters range from a few nodes to 'n' number of nodes. It is used in the production environment, where we have thousands of  machines in the Hadoop cluster. The daemons of Hadoop run on these clusters.We have to configure separate masters and separate slaves in this distribution, the implementation of which is quite complex. In this configuration, Namenode and Datanode runs on different hosts and there are nodes on which task tracker runs. The root of the distribution is referred as HADOOP_HOME.

What are the Hadoop configuration files at present?
There are 3 configuration files in Hadoop:
1. conf/core-site.xml:
<configuration>     <property>         <name>fs.default.name</name>         <value>hdfs://localhost:9000</value>     </property></configuration>
2. conf/hdfs-site.xml:
<configuration>     <property>         <name>dfs.replication</name>         <value>1</value>     </property></configuration>

3. conf/mapred-site.xml:
<configuration>     <property>         <name>mapred.job.tracker</name>         <value>localhost:9001</value>     </property></configuration>
Does Hadoop follows the UNIX pattern?
Yes, Hadoop closely follows the UNIX pattern. Hadoop also has the ‘conf‘ directory as in the
case of UNIX.

In which directory Hadoop is installed?

Cloudera and Apache has the same directory structure. Hadoop is installed in cd
/usr/lib/hadoop-0.20/.

What are the port numbers of Namenode, job tracker and task tracker?

The port number for Namenode is ’70′, for job tracker is ’30′ and for task tracker is ’60′.

What is a spill factor with respect to the RAM?
Spill factor is the size after which your files move to the temp file. Hadoop-temp directory is
used for this.

Is fs.mapr.working.dir a single directory?

Yes, fs.mapr.working.dir it is just one directory.

Which are the three main hdfs-site.xml properties?

The three main hdfs-site.xml properties are:
1. dfs.name.dir which gives you the location on which metadata will be stored and where
DFS is located – on disk or onto the remote.
2. dfs.data.dir which gives you the location where the data is going to be stored.
3. fs.checkpoint.dir which is for secondary Namenode.

How to come out of the insert mode?

To come out of the insert mode, press ESC, type :q (if you have not written anything) OR
type :wq (if you have written anything in the file) and then press ENTER.

What is Cloudera and why it is used?

Cloudera is the distribution of Hadoop. It is a user created on VM by default. Cloudera
belongs to Apache and is used for data processing.

What happens if you get a ‘connection refused java exception’ when you type hadoop

fsck /?
It could mean that the Namenode is not working on your VM.

We are using Ubuntu operating system with Cloudera, but from where we can

download Hadoop or does it come by default with Ubuntu?
This is a default configuration of Hadoop that you have to download from Cloudera or from
Edureka’s dropbox and the run it on your systems. You can also proceed with your own
configuration but you need a Linux box, be it Ubuntu or Red hat. There are installation
steps present at the Cloudera location or in Edureka’s Drop box. You can go either ways.

What does ‘jps’ command do?

This command checks whether your Namenode, datanode, task tracker, job tracker, etc are
working or not.

How can I restart Namenode?

1. Click on stop-all.sh and then click on start-all.sh OR
2. Write sudo hdfs (press enter), su-hdfs (press enter), /etc/init.d/ha (press enter) and
then /etc/init.d/hadoop-0.20-namenode start (press enter).

What is the full form of fsck?

Full form of fsck is File System Check.

How can we check whether Namenode is working or not?

To check whether Namenode is working or not, use the command /etc/init.d/hadoop-
0.20-namenode status or as simple as jps.

What does the command mapred.job.tracker do?

The command mapred.job.tracker lists out which of your nodes is acting as a job tracker.

What does /etc /init.d do?

/etc /init.d specifies where daemons (services) are placed or to see the status of these
daemons. It is very LINUX specific, and nothing to do with Hadoop.

How can we look for the Namenode in the browser?

If you have to look for Namenode in the browser, you don’t have to give localhost:8021, the
port number to look for Namenode in the brower is 50070.

How to change from SU to Cloudera?

To change from SU to Cloudera just type exit.

Which files are used by the startup and shutdown commands?

Slaves and Masters are used by the startup and the shutdown commands.

What do slaves consist of?

Slaves consist of a list of hosts, one per line, that host datanode and task tracker servers.

What do masters consist of?

Masters contain a list of hosts, one per line, that are to host secondary namenode servers.

What does hadoop-env.sh do?

hadoop-env.sh provides the environment for Hadoop to run. JAVA_HOME is set over here.

Can we have multiple entries in the master files?

Yes, we can have multiple entries in the Master files.

Where is hadoop-env.sh file present?

hadoop-env.sh file is present in the conf location.

In Hadoop_PID_DIR, what does PID stands for?

PID stands for ‘Process ID’.

What does /var/hadoop/pids do?

It stores the PID.

What does hadoop-metrics.properties file do?

hadoop-metrics.properties is used for ‘Reporting‘ purposes. It controls the reporting for
Hadoop. The default status is ‘not to report‘.

What are the network requirements for Hadoop?

The Hadoop core uses Shell (SSH) to launch the server processes on the slave nodes. It
requires password-less SSH connection between the master and all the slaves and the
secondary machines.

Why do we need a password-less SSH in Fully Distributed environment?

We need a password-less SSH in a Fully-Distributed environment because when the cluster
is LIVE and running in Fully
Distributed environment, the communication is too frequent. The job tracker should be
able to send a task to task tracker quickly.

Does this lead to security issues?

No, not at all. Hadoop cluster is an isolated cluster. And generally it has nothing to do with
an internet. It has a different kind of a configuration. We needn’t worry about that kind of a
security breach, for instance, someone hacking through the internet, and so on. Hadoop has
a very secured way to connect to other machines to fetch and to process data.

On which port does SSH work?

SSH works on Port No. 22, though it can be configured. 22 is the default Port number.

Can you tell us more about SSH?

SSH is nothing but a secure shell communication, it is a kind of a protocol that works on a
Port No. 22, and when you do an SSH, what you really require is a password.

Why password is needed in SSH localhost?

Password is required in SSH for security and in a situation where passwordless
communication is not set.

Do we need to give a password, even if the key is added in SSH?

Yes, password is still required even if the key is added in SSH.

What if a Namenode has no data?

If a Namenode has no data it is not a Namenode. Practically, Namenode will have some
data.

What happens to job tracker when Namenode is down?

When Namenode is down, your cluster is OFF, this is because Namenode is the single point
of failure in HDFS.

What happens to a Namenode, when job tracker is down?

When a job tracker is down, it will not be functional but Namenode will be present. So,
cluster is accessible if Namenode is working, even if the job tracker is not working.

Can you give us some more details about SSH communication between Masters and

the Slaves?
SSH is a password-less secure communication where data packets are sent across the slave.
It has some format into which data is sent across. SSH is not only between masters and
slaves but also between two hosts.

What is formatting of the DFS?

Just like we do for Windows, DFS is formatted for proper structuring. It is not usually done
as it formats the Namenode too.

Does the HDFS client decide the input split or Namenode?

No, the Client does not decide. It is already specified in one of the configurations through
which input split is already configured.

In Cloudera there is already a cluster, but if I want to form a cluster on Ubuntu can

we do it?
Yes, you can go ahead with this! There are installation steps for creating a new cluster. You
can uninstall your present cluster and install the new cluster.

Can we create a Hadoop cluster from scratch?

Yes we can do that also once we are familiar with the Hadoop environment.

Can we use Windows for Hadoop?

Actually, Red Hat Linux or Ubuntu are the best Operating Systems for Hadoop. Windows is
not used frequently for installing Hadoop as there are many support problems attached
with Windows. Thus, Windows is not a preferred environment for Hadoop.

Big Data

Big Data Interview Questions, Downloads, Tutorials.

Apache Hadoop Interview Questions Answers

MapReduce Interview Questions

Apache Pig Interview Questions Answers

Big Data Downloads 

 

Siebel Server Down Troubleshooting

Siebel Server is not coming up in Linux/Unix and Windows.

Server Busy Error for Siebel Server. Steps for Troubleshooting.

1. Ensure that you validate that the .srf file is not corrupt by placing the SRF on a dedicated environment. If the dedicated client is facing issues there could be possibly  two reasons as cited below.
  • The SRF file in the server got corrupt.
  • The Oracle Database Server itself is down
2. Check that no Siebel processes for the enterprise are still running.Windows: Check Task Manager for any Siebel process for the enterprise still running.

3. Stop siebel servers after executing ./siebenv.sh and command stop_server all

Solaris: Execute ps -ef | grep [directory path] (eg. ps -ef | grep /app/siebel/siebsrvr).
ps -ef | grep sieb
Ensure that all processes for that enterprise are killed.
use kill -9 pid

4. Delete any file that exists in directory %SIEBEL_ROOT%\sys with name like:
osdf.[SiebelEnterprise].[SiebelServer]
Where
[SiebelEnterprise] = The Siebel Enterprise name
[SiebelServer] = The Siebel Server name.

5. Delete any file that exists in directory %SIEBEL_ROOT%\admin with name like:*.shm
dhm files are shared memory files. This file should be automatically deleted when the Siebel server is
shut down, if it still exists when the Siebel server is down then it has been corrupted and not correctly removed.

6.Delete fdr and core files as these file eat up large amounts of memory.

7.Cleanup unwanted logarchive and log files so that fresh logs can be monitered and space can be freed up.

8.Try to restart server after executing ./siebenv.sh and command start_server all
Note- At this point if the server still does not restart, you need to check the enterprise log for the reason.
The enterprise log is located in:
%SIEBEL_ROOT%\enterprises\[SiebelEnterprise]\[SiebelServer]\log
The enterprise log has name with format:
[SiebelEnterprise].[SiebelServer].log

9. If no enterprise logs are getting creted there are connectivity issues with the database:
i.e change of db password for SADMIN user
run odbcsql from siebsrvr/bin to check connectivity issues
odbcsql /u SADMIN /p SADMIN /s DSN Name

10. Any changes which lead to the corruption of siebns.dat will also result in the servers not coming up.
usually NameSrvr logs tell us connectivity related information and errors like key not found.Try reverting
to an old working siebns.dat file.

11.If the environment is LDAP authenticated any changes in the LDAP trees can also affect the environment.
please verify the same.

12.Check the SCBroker and SRBroker logs;you would get a hint.

13.Use netstat -an|grep 2320 for verifying that the gateway service port is listening.

14.Use netstat -an|grep 2321 for verifing that the SRBroker/SCBroker port is listening

Big Data and Apache Hadoop Free Downloads

Big Data and Hadoop related Downloads. 

Download Hadoop for Windows:

Hadoop is a powerful framework for automatic parallelization of computing tasks. Mostly available for Unix/Linux, this installer help develop applications and analyze big data stored in Apache Hadoop running on Microsoft Windows.

Download Hortonworks Hadoop Sandbox:

Learn Hadoop with Hortonworks Sandbox. A free download that comes with many interactive Hadoop tutorials.

Download Cloudera Hadoop :

Revel insights from all your data, store everything forever without data loss or archiving, and make data an integral component of your enterprise.

Download MapR Hadoop :

MapR delivers on the promise of Hadoop with a proven, enterprise-grade Big Data platform that supports a broad set of mission-critical and real-time production.

Download Cloudera Impala :

Cloudera Impala is an open source Massively Parallel Processing (MPP) query engine that runs natively on Apache Hadoop.

Download Cassandra :

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers.

Download VoltDB :

A high performance, in memory, scalable RDBMS for. Big Data, high velocity OLTP and real-time analytics.

MongoDB 3rd-Party Admin Tools:

Monitoring is a critical component of all database administration. A number of third party monitoring tools have support for MongoDB. There are some very good tools available in the MongoDB package but the list provided here will help you a lot in administration.

Download Oracle NoSQL Database:

Oracle NoSQL Database is a distributed, highly performance, highly available scalable key-value database. The Oracle NoSQL Database is a noSQL-type distributed key-value database by Oracle

Download Couchbase Server:

Couchbase Server, a distributed, non-relational NoSQL database that can easily accommodate changing data management needs.

Download Neo4J :

Neo4j is an open-source graph database, implemented in Java. Developers describe Neo4j as "embedded, disk-based, fully transnational Java persistence engine that stores data structured in graphs rather than in tables". Neo4j is the most popular graph database.

Continuity AppFabric :

Delivered as a cloud PaaS, the Continuity AppFabric is an application run-time and data platform, which sits on top of open source Hadoop.

Amazon Hadoop/Mapreduce:

Amazon Elastic MapReduce automatically spins up a Hadoop implementation of the MapReduce framework on Amazon EC2 instances. Amazon Elastic MapReduce (Amazon EMR) makes it easy to provision and manageHadoop in the AWS Cloud.

Download Spring For Hadoop:

Spring for Apache Hadoop is a framework for application developers to take advantage of the features of both Hadoop and Spring.

MORTAR Hadoop Platform :

Mortar have a great platform for leveraging Hadoop, Pig and Python. Mortar is the fastest and easiest way to work with Pig.

I will update other Big Data and Apache Hadoop Free Downloads in my next post.

Hadoop Pig Interview Questions and Answers

Apache Pig Interview Questions and Answers. 

If you are planning to pursue a career in Hadoop, then you can expect some PIG interview Questions. 

Explain what is PIG in Big Data?
PIG is nothing but a platform for analyzing large data sets that consist of high level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. PIG’s infrastructure layer consists of a compiler that produces sequence of MapReduce Programs. Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of creating and executing map-reduce jobs on very large data sets. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets

Explain PIG's language layer an its properties?
Pig’s language layer currently consists of a textual language called Pig Latin, which has the following key properties:
Ease of programming. Pig is intended to make complex tasks comprised of multiple interrelated data transformations that are explicitly encoded as data flow sequences easy to write, understand, and maintain.
Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensible. Users can create their own functions to do special-purpose processing.

What is bag in PIG?
Answer : A bag is a representation of one of the data models in Pig. A bag is nothing but a collection of tuples in an unordered form, this collection might hold possible duplicate tuples or records. Bags are used to store collections while grouping. The size of bag is the size of the local disk, this means that the size of the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”

Explain what is the difference between logical and physical plans in PIG?
Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

What is the difference between PIG and SQL
The differences between Pig and SQL include Pig's usage of lazy evaluation, Pig's usage for ETL, Pig's ability to store data at any point during a pipeline, Pig's explicit declaration of execution plans, and Pig's support for pipeline splits.

  • Whereas, it has been argued DBMSs are substantially faster than the MapReduce system once the data is loaded, but that loading the data takes considerably longer in the database systems. It has also been argued RDBMSs offer out of the box support for column-storage, working with compressed data, indexes for efficient random data access, and transaction- level fault tolerance.
  • Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead declarative. In SQL users can specify that data from two tables must be joined, but not what join implementation to use. Pig Latin allows users to specify an implementation or aspects of an implementation to be used in executing a script in several ways.
  • In effect, Pig Latin programming is similar to specifying a query execution plan, making it easier for programmers to explicitly control the flow of their data processing task.
  • SQL is oriented around queries that produce a single result. SQL handles trees naturally, but has no built in mechanism for splitting a data processing stream and applying different operators to each sub-stream. Pig Latin script describes a directed acyclic graph (DAG) rather than a pipeline.
  • Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline development. If SQL is used, data must first be imported into the database, and then the cleansing and transformation process can begin.

What do you mean by co-group and explain its function in Pig? 
Answer : The COGROUP command in Pig is a combination of some sort of both a GROUP and a JOIN.
COGROUP is very similar to GROUP in its function. Co-group does the joins of a data set by grouping one particular data set only. It does the grouping of the elements by their common field or key and after that it returns a set of records containing two separate bags. Now, the first bag consists of the records of the first data set with the common data set and the second bag, the second data set with the set of common data. Let us illustrate it by an example cited below.

Let us consider we have a dataset of employees and their offices:
$ cat >
employees.csv
Steve,Microsoft
Joe,Google
John,Cisco
George,Microsoft
Matt,Google
We will COGROUP the companies using Pig as cited below:
employees= LOAD 'employees.csv'
USING PigStorage(',')
AS (employee:chararray,company:chararray);
grouped = COGROUP employees BY company;
DUMP grouped;
This returns the list of companies for the employees. For each company, Pig groups the matching rows into bags. The resulting table grouped is:

groupemployees
Google{(Joe,Google),(Matt,Google)}
Microsoft{(Steve,Microsoft),(George,Microsoft)}
Cisco{(John,Cisco)}
Can you give us some examples how Hadoop is used in real time environment?
Let us consider a scenario that the we have an exam consisting of 10 Multiple-choice questions and 20 students appear for that exam. Every student will attempt each question. For each question and each answer option, a key will be generated. So we have a set of key-value pairs for all the questions and all the answer options for every student. Based on the options that the students have selected, you have to analyze and find out how many students have answered correctly. This isn’t an easy task. Here Hadoop comes into picture! Hadoop helps you in solving these problems quickly and without much effort. You may also take the case of how many students have wrongly attempted a particular question.  

What is BloomMapFile used for in PIG?
The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile. BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.

Does ‘ILLUSTRATE’ run MR job?
No, illustrate will not pull any MR, it will pull the internal data. On the console, illustrate will not do any job. It just shows output of each stage and not the final output.

Is the keyword ‘DEFINE’ like a function name?
Yes, the keyword ‘DEFINE’ is like a function name. Once you have registered, you have to define it. Whatever logic you have written in Java program, you have an exported jar and also a jar registered by you. Now the compiler will check the function in exported jar. When the function is not present in the library, it looks into your jar.

Is the keyword ‘FUNCTIONAL’ a User Defined Function (UDF)?
No, the keyword ‘FUNCTIONAL’ is not a User Defined Function (UDF). While using UDF, we have to override some functions. Certainly you have to do your job with the help of these functions only. But the keyword ‘FUNCTIONAL’ is a built-in function i.e a pre-defined function, therefore it does not work as a UDF.  

Explain why do we need MapReduce during Pig programming?
Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. The language we use for this platform is: Pig Latin. A program written in Pig Latin is like a query written in SQL, where we need an execution engine to execute the query. So, when a program is written in Pig Latin, Pig compiler will convert the program into MapReduce jobs. As such, MapReduce acts as the execution engine.  

What does FOREACH do? 
FOREACH is used to apply transformations to the data and to generate new data items. The name itself is indicating that for each element of a data bag, the respective action will be performed.

Syntax : FOREACH bagname GENERATE expression1, expression2, ….. The meaning of this statement is that the expressions mentioned after GENERATE will be applied to the current record of the data bag.

Does Pig give any warning when there is a type mismatch or missing field? 
No, Pig will not show any warning if there is no matching field or a mismatch. If you assume that Pig gives such a warning, then it is difficult to find in log file. If any mismatch is found, it assumes a null value in Pig.

Can we say cogroup is a group of more than 1 data set? 
Cogroup is a group of one data set. But in the case of more than one data sets, cogroup will group all the data sets and join them based on the common field. Hence, we can say that cogroup is a group of more than one data set and join of that data set as well.

Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In which kind of scenarios MR jobs will be more useful than PIG?
Let us take a scenario where we want to count the population in two cities. I have a data set and sensor list of different cities. I want to count the population by using one mapreduce for two cities.

Let us assume that one is Bangalore and the other is Noida. So I need to consider key of Bangalore city similar to Noida through which I can bring the population data of these two cities to one reducer.

The idea behind this is some how I have to instruct map reducer program – whenever you find city with the name ‘Bangalore‘ and city with the name ‘Noida’, you create the alias name which will be the common name for these two cities so that you create a common key for both the cities and it get passed to the same reducer. For this, we have to write custom partitioner. In mapreduce when you create a ‘key’ for city, you have to consider ’city’ as the key. So, whenever the framework comes across a different city, it considers it as a different key. Hence, we need to use customized partitioner.

There is a provision in Mapreduce only, where you can write your custom partitioner and mention if city = Bangalore or Noida then pass similar hashcode. However, we cannot create custom partitioner in Pig. As Pig is not a framework, we cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works better than Pig.

Siebel srvrmgr Commands in Windows

Siebel Server Commands using srvrmgr.

Navigate in command prompt to the Siebel Server SRVRMGR directory.

C:\sba80\siebsrvr\BIN>

In the command line window type the following.

srvrmgr /g <siebel_gateway_host_address> /e <siebel_enterprise_name> /s  <siebel_server_name> /u SADMIN /p SADMIN
  • list servers
  • set server SBLAPP1
  • unset server SBLAPP1
  • unset server
  • list component groups
  • disable component group Workflow\
  • enable component group Remote for server <siebel_server_name>
  • assign component group Remote to server <siebel_server_name>
  • shutdown appserver <siebel_server_name>
  • startup appserver <siebel_server_name>

Siebel Server Log Level for components

  • list evtloglvl for component eCommunicationsObjMgr_enu
  • change evtloglvl Error=5 for component eCommunicationsObjMgr_enu
  • change evtloglvl GenericLog=1, Performance=1, Trace=0, TaskConfig=0, SrmRouting=0,
  • SQLParseAndExecute=1,
  • SQLSlowQuery=4, Perf=1, ProcReq=1 for comp WorkMon
  • activate component definition GenNewDb for server <siebel_server_name>

Siebel Server List Commands for components.

  • list session for comp eComm% login SADMIN
  • list active sessions for comp eCommunica% login SADMIN for server %SBL%
  • list comp WfProcMgr% for server %
  • list comp WfProcMgr%
  • list task for comp eCommunicationsObjMgr_enu
  • list parameters for server %
  • list advanced parameters for server %
  • list tasks for process (siebel 8.1)

Siebel SRVRMGR List commands:

  •  list { component groups | compgrps }
  •  list { component definitions | comp defs }
  •  list { component types | comp types }
  •  list { parameter definitions | param defs }
  •  list { stateval definitions | sval defs }
  •  list { statistic definitions | stat defs }
  •  list [siebel] servers
  •  list all servers
  •  list tasks
  •  list session
  •  list { evtloglvl  | event loglevel }
  •  list { named subsystem }
  •  list { subsystems }
  •  list { components | comps }
  •  list { parameters | params }
  •  list { enterprise parameters | ent params }
  •  list { state values | statevals | svals }
  •  list { statistics | stats }

Siebel Server commands:

  •  startup appserver <server_name>
  •  shutdown appserver <server_name>

Siebel Component Group SRVRMGR commands:

  •  create component group
  •  enable component group
  •  disable component group
  •  remove component group
  •  delete component group
  •  assign component group

Siebel Component commands:

  •  startup component
  •  pause component
  •  resume component
  •  shutdown component
  •  auto start component
  •  manual start component
  •  shutdown fast component

Siebel Component Definition commands:

  •  create component definition
  •  activate component definition
  •  deactivate component definition
  •  delete component definition

Siebel Task management commands:

  •  run task
  •  start task
  •  pause task
  •  resume task
  •  stop task

Siebel Parameter management commands:

  •  change parameter
  •  delete parameter

Siebel Event logging level management commands:

  •  change { evtloglvl  | event loglevel }

Siebel Named Subsystem Commands:

  •  create named subsystem
  •  delete named subsystem

Siebel Local commands:

  •  show
  •  show <variable>
  •  set
  •  set <variable> <value>
  •  unset <variable>
  •  alias {<aliasName> <aliasValue>} (to add a new alias)
  •  alias                            (to list all the existing aliases)
  •  unalias <aliasName>              (to remove an existing alias)
  •  save preferences                 (to save the preferences)
  •  load preferences                 (to load the preferences)
  •  configure list <ValidListCmd> 
  •  [show [ all | <column_name>   [ (column_width[disp_fmt]) ] [ as column_display_name ]
  •  [, <column_name> ] [ (column_width[disp_fmt]) ] [ as column_display_name ] ... ] ]

Siebel Spool and Read commands:

  •  Turn Spool on:  spool <filename>
  •  Turn spool off:  spool off     
  •  read <filename>

Siebel essential commands:

  •  backup namesrvr [File name]
  •  refresh enterprise [server] | entsrvr | ent server
  •  sleep <second>
  •  flush fdr for { process | proc } <OS_process_id>
  •  [ [app] server <server_name> ]

Hadoop HDFS Interview Questions Answers

A complete list of Hadoop Interview Questions and Answers on HDFS. 

The below list of Big Data and Hadoop Interview Questions will be helpful in clearing a Big Data Interview.

Please give a detailed overview about the Big Data being generated by social networking website Facebook? 
As of January 31, 2013, there are 1.08 billion monthly active users on Facebook and 685 million mobile users. On an average, 3.2 billion likes and comments are posted every day on Facebook. 72% of web audience is on Facebook. There are so many activities going on facebook from wall posts, sharing images, videos, writing comments and liking posts, etc.  Facebook started using Hadoop in mid-2009 and was one of the initial users of Hadoop.

Explain what are the three characteristics of Big Data according to IBM ? 
The three characteristics of Big Data are: 

Velocity: Analyzing 2 million records each day to identify the reason for losses. 
Variety: text, images, audio, video, sensor data, log files, etc.
Volume: Twitter and Facebook generating 550+ terabytes of data per day. 


Explain why do we need Hadoop?

Everyday we are witnessing a large amount of unstructured data is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that data which is present in different machines at different locations. This is where Hadoop arises and addresses the problem. Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost effective way. It uses the very concept of MapReduce which enables it to divide the query into small parts and processing them in parallel. This is also known as parallel computing.
What are some of the characteristics of Hadoop framework?
Hadoop framework is written in core Java. It is designed to solve problems that involve analyzing large data (e.g. petabytes). The programming model is based on Google’s Open Source MapReduce. The infrastructure is based on Google’s Big Data and Distributed File System. Hadoop handles large files/data throughput and supports data intensive distributed applications. Hadoop is scalable as more nodes can be easily added to it without any effect.

Can you give examples of some companies that are using Hadoop structure?

Almost all social networking companies use Hadoop. Companies using the Hadoop structure are Facebook, eBay, Twitter, Cloudera, EMC, MapR, Hortonworks, Amazon, Google and so on.

Explain the basic difference between Hadoop and a traditional RDBMS ?

Traditional RDBMS is used for transactional systems to report and archive the data, whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it. RDBMS will be useful when you want to seek one record from Big data, whereas, Hadoop will be useful when you want Big data in one shot and perform analysis on that later.

Explain what do you mean by structured and unstructured data?

Structured data is the data that is easily identifiable as it is organized in the form of a structure. The most common form of structured data is a database where specific information is stored in tables, that is, rows and columns. Unstructured data refers to any data that cannot be identified easily. It could be in the form of videos, images, documents, email, logs and random text. It is not in the form of rows and columns.

What are the core components of Hadoop?

Core components of Hadoop are HDFS and MapReduce. MapReduce is used to process such large data sets and HDFS is basically used to store large data sets.

What is HDFS and what are the key features of HDFS?

HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware. HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.


What is Fault Tolerance?

Consider a scenario - you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations and to retrieve the file back, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system as backup.

Since replication causes data redundancy it should have been discouraged but it pursued in HDFS. Explain

HDFS works with hardware systems with average configurations which has high chances of getting crashed or damaged any time. In order to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places at least in three different locations. So, even if one of them is corrupted and the other is unavailable for some time for any reason, then data can be accessed from the third one. As such, there is no chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerance.

Data is replicated 3 times in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?

Since there are three nodes, when we send the MapReduce programs, calculations will be done only on the original data and not on the other two. The master node will only know which node exactly has that particular data. In case, if one of the nodes is not responding, it is considered to be failed. Only then, the required calculation will be done on the second replica.

What do you mean by throughput? How does HDFS get a good throughput?

Throughput is nothing but the amount of work done in a unit time. It is used to describe how fast the data is getting accessed from the system. It is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared  among different systems in the network. Hence all the systems will be executing the tasks assigned to them independently and in parallel. As such, the work will be completed in a very short period of time thereby ensuring the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data in a significant way.

Explain what is streaming access?

HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming access is a very important feature in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.

What do you mean by commodity hardware? Does commodity hardware include RAM?

Commodity hardware is a not-so-expensive system which is not of high quality or high-availability. Hadoop can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop. Commodity hardware must includes RAM because there will be some process or services which will be running on RAM in each of these systems.

Explain what is a Namenode? Is Namenode also a commodity?

Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS. 

Namenode can never be a commodity hardware because the entire HDFS rely on it. It is the single point of failure in HDFS. Namenode has to be a high-availability machine.


What do you mean by metadata?

Metadata is "data about data". It is the information about the data stored in datanodes such as location of the file, size of the file and so on.

What is a Datanode?

Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients in the network.

Why do we use HDFS for applications having large data sets and not when there are lot of small files?

HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. The reason is Namenode is a very expensive high performance system, so it is not prudent to occupy the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. So, when there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized performance, HDFS supports large data sets instead of multiple small files.

What is a daemon process?

Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is “services” and in Dos is ” TSR”.

Explain what is a job tracker?

Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

What is a task tracker?

Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

Is Namenode machine same as datanode machine as in terms of hardware?

It depends upon the cluster you are trying to create. The Hadoop VM can be there on the same machine or on another machine. For instance, in a single node cluster, there is only one machine, whereas in the development or in a testing environment, Namenode and datanodes are on different machines.

What is a heartbeat in HDFS?

A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.
Are Namenode and job tracker on the same host?
No, in practical environment, Namenode is on a separate host and job tracker is on a separate host.

What is a ‘block’ in HDFS?

A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks.
If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size?No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.

Explain what are the benefits of block transfer?

A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?

In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.

How indexing is done in HDFS?

Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.

If a data Node is full how it’s identified?

When data is stored in datanode, then the metadata of that data will be stored in the Namenode. So Namenode will identify if the data node is full.

If datanodes increase, then do we need to upgrade Namenode?

While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a requirement rarely arise.

Are job tracker and task trackers present in separate machines?

Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

When we send a data to a node, do we allow settling in time, before sending another data to that node?

Yes, we do.

Does hadoop always require digital data to process?

Yes.  Hadoop always require digital data to be processed.

On what basis Namenode will decide which datanode to write on?

As the Namenode has the metadata (information) related to all the data nodes, it knows which datanode is free.

Doesn’t Google have its very own version of DFS?

Yes, Google owns a DFS known as “Google File System (GFS)”  developed by Google Inc. for its own use.

Who is a ‘user’ in HDFS? Is client the end user in HDFS?

A user has some query or who needs some kind of data. Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker) or datanode (task tracker). Hence the answer is no.

What is the communication channel between client and namenode/datanode?

The mode of communication is SSH(Secure Shell).

What is a rack? On what basis data will be stored on a rack?

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.
When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack“. This rule is known as “Replica Placement Policy“.

Do we need to place 2nd and 3rd data in rack 2 only?

Yes we have to place it so as to avoid datanode failure.

What if rack 2 and datanode fails?

If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default.

What is a Secondary Namenode? Is it a substitute to the Namenode?

The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk or the file system. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes down.

What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?

In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.

What is MapReduce? Can you explain how do ‘map’ and ‘reduce’ work?

Map Reduce is the ‘heart‘ of Hadoop that consists of two parts – ‘map’ and ‘reduce’. Maps and reduces are programs for processing data. ‘Map’ processes the data first to give some intermediate output which is further processed by ‘Reduce’ to generate the final output. Thus, MapReduce allows for distributed processing of the map and reduction operations.
Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes process the tasks assigned to them and make a key-value pair and returns the intermediate output to the Reducer. The reducer collects this key value pairs of all the datanodes and combines them and generates the final output.

What is ‘Key value pair’ in HDFS?

Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

What is the difference between MapReduce engine and HDFS cluster?

HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce Engine is the programming module which is used to retrieve and analyze data.

Is map like a pointer?

No, Map is not like a pointer.

Do we require two servers for the Namenode and the datanodes?

Yes, we need to have two different servers for the Namenode and the datanodes. The reason for that is because Namenode requires highly configurable system as it stores information about the location details of all the files stored in different datanodes and on the other hand, datanodes require low configuration system.

Why are the number of splits equal to the number of maps?

The number of maps is equal to the number of input splits because we want the key and value pairs of all the input splits.

Is a job split into maps?

No, a job is not split into maps. Spilt is created for the file. The file is placed on datanodes in blocks. For each split,  a map is needed.


Can Hadoop be compared to NOSQL database like Cassandra?


Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no DFS in NOSQL. Hadoop is not a database. It’s a filesystem (HDFS) and distributed programming framework (MapReduce).

Which are the two types of ‘writes’ in HDFS?

There are two types of writes in HDFS: posted and non-posted write. Posted Write is when we write it and forget about it, without worrying about the acknowledgement. It is very similar to our traditional Indian post. In a Non-posted Write, we wait for the acknowledgement. It is similar to the today’s courier services. Naturally, non-posted write is more expensive than the posted write. It is much more expensive, though both writes are asynchronous.

Why ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?

Reading is always done in parallel because by doing so we can access the data fast. But we never perform the write operation in parallel. The reason is that if we perform the write operation in parallel, then it might result in data inconsistency which is really not acceptable. For example, you have a file and two nodes are trying to write data into the file in parallel, then the first node does not know what the second node has written and vice-versa. As such, this makes it confusing which data to be stored and accessed.
 

Aired | Copyright © 2009-2014; All Rights Reserved 2014 | Contact Us | About Us