• In depth understanding and knowledge of Hadoop and Spark architecture and RDD transformation
• Proven experience in developing solutions using Spark architecture and PySpark for data engineering pipelines, transformation and aggregation of data from variety of sources into data lake.
• Atleast 3 or more years of relevant experience in developing PySpark programs using APIs. Expertise in different file formats like parquet, ORC.
• Experience with troubleshooting, fine tuning Spark and python based applications for scalability and performance.
• Experience in designing hive tables to handle velocity, variety and to handle huge volumes.
• Experience in data ingestion, processing and analyzing data using Spark/SQL from disparate sources.
• Knowledge in using Spark-Submit and Spark UI. Experience in creating and then performing operations on Spark RDD.
• Experience in creating Spark Data Frames from RDD, HIVE and Parquet files and then performing Joins and Aggregations on Dataframes.
• Experience in processing data from Python and other API modules.