adporn.net Parallel writes with bucketing - Big Data Analytics with Hadoop and Apache Spark Video Tutorial | LinkedIn Learning, formerly Lynda.com

LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings.

Start free trial Sign in

From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 24,700 courses taught by industry experts.

Parallel writes with bucketing

Parallel writes with bucketing

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial Buy for my team

Parallel writes with bucketing

“

- [Instructor] As reviewed in the earlier videos, bucketing can be used to partition data where there are a large number of unique values for a given column. In this case, we create buckets, again, based on the product column. We will create three buckets. In order to do bucketing, we use the bucket by method. We specify the number of buckets and the column to bucket by. We also want to save this data as a hive table, adding a savers table with the table name saves the data to hive. The hive table is created under this park warehouse folder as we are using a dummy HD of a system, we run an example query from the table to verify its contents. Let's execute this code now. We can see the list of databases and the locations using the spark list databases method. This shows the folder where the actual file is stored. We can go to the HTFS, under this park warehouse folder, to examine its contents. We can see the product bucket table stored here. This is stored with multiple parts, each…

Contents