$ 22

Hadoop is an Apache open source framework written in java that allows distributed processing of large data sets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.




  • Traditional system Vs. Hadoop system HDFS and Yarn
  • Why Spark ? Introduction To RDD
  • Loading Data to SPARK Using Text file and collect API Item wise count &Reduced By Key Transformation
  • Spark Architecture _DAG stages Tasks driver Executor
  • Yarn _client_ yarn_ cluster _error handling _ accumulator
  • Shuffle JOIN_ Broad Cast Join
  • Map Partition Hash Partition Custom Partition_ File formats Text Input Format
  • Sequence file and Avro File
  • Reduce, fold, fold Left, aggregate By Key
  • Spark SQL Introduction_ Data source _ Data frames_ Loading  csv file
  • Reading Json, xml files_ Json Input Format_ Multi line Json Input Format
  • Simple Queries_ Join_ Nested Queries
  • Simple queries _ Joins using Data frame APIs_ Broadcast Join_ Custom Process using Udf and transform API_ Rename individual column _ all columns
  • Window operation – moving avg , cumulative sum ,previous visit _ rank_ updated records
  • Spark integration with Hive_ Hive Architecture_ Read and Write operation on Hive Table using Spark_ Sensitisation example
  • orc and parquet file format


Contact Us