2

I have two dataframes

  1. Max of date column from one dataframe :: one column, one row - df1, column : maxdate

  2. Multiple records having date column :: df2 columns : col1,col2,col3..colDate

I want filter df2 based df1.maxdate, df2.colDate > df1.maxdate

If I specify like below then its working.

df2.filter(col("colDate").gt(lit(2020-01-01)))

However, I'm not able to use df1.maxdate. I'm trying java to achieve this soulution.

DataType is date in both dataFrame columns

I m trying to achieve this through spark transformation

select * from a 
where a.col > (select max(b.col) from b)

In my example

Table a = df2
Table b = df1

2 Answers 2

1

the below code might be helpful for you,

val df1 = Seq(('2020-01-02')).toDF("Maxate")

df1.show()

/*
+----------+
|    Maxate|
+----------+
|2020-01-02|
+----------+
*/

val df2 = Seq(("2020-01-01","A","B"),("2020-01-03","C","D")).toDF("colDate","col1","col2")

/*
+----------+----+----+
|   colDate|col1|col2|
+----------+----+----+
|2020-01-01|   A|   B|
|2020-01-03|   C|   D|
+----------+----+----+
*/
val maxDate=df1.collect.map(row=>row.getString(0)).mkString

df2.filter($"colDate">maxDate).show()

/*
+----------+----+----+
|   colDate|col1|col2|
+----------+----+----+
|2020-01-03|   C|   D|
+----------+----+----+
*/
Sign up to request clarification or add additional context in comments.

1 Comment

your answer helped me but 484 provide 3 options hence marked that as my answer. I dont have credit to upvote your answer now, will do that soon
0

createTempView on two dataframes then using sql query we can filter the only required date.

Example:

Option1: using createTempView:

df1.show()
//+----------+
//|   Maxdate|
//+----------+
//|2020-01-01|
//+----------+

df2.show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-01|   A|   B|
//|2020-01-03|   C|   D|
//+----------+----+----+


df1.createOrReplaceTempView("tmp")

df2.createOrReplaceTempView("tmp1")

sql("select * from tmp1 where coldate > (select maxdate from tmp)").show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

Option-2:Using dataframe variable:

Another way would be storing into variable then using the variable then use the variable in dataframe filter.

val max_val=df1.collect()(0)(0).toString

df2.filter(col("colDate") > max_val).show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

Option-3:Using dataframe crossJoin and expr:

In this case we are not creating variable instead using dataframe column to filter only the required rows.

df2.crossJoin(df1).
filter(expr("colDate > Maxdate")).
drop("Maxdate").
show()
//+----------+----+----+
//|   colDate|col1|col2|
//+----------+----+----+
//|2020-01-03|   C|   D|
//+----------+----+----+

4 Comments

Thanks, I want try out option 2: val max_val=df1.collect()(0)(0).toString. However, I tried to google df.collect for java to convert into java syntax but no luck :( Any idea from where I can get java syntax for this, any spark documentation page will be helpful
java also has .collect() but not .toString so use .collect() returns array of strings then convert them to string using Arrays.toString(arr) stackoverflow.com/questions/5283444/…
I have tried below code df2.select("maxdate").collect().toString() in java.. where my df2 is having 1 column(maxdate) and 1 row. kind of stuck here now. if I write df.collect() [0][0] its not working. Also if write df.collect() then return datatype is Row..
I used collectAsList option then extracted from list value. thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.