pyspark.pandas.DataFrame.nlargest¶
-
DataFrame.nlargest(n: int, columns: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]]]) → pyspark.pandas.frame.DataFrame[source]¶ Return the first n rows ordered by columns in descending order.
Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.
This method is equivalent to
df.sort_values(columns, ascending=False).head(n), but more performant in pandas. In pandas-on-Spark, thanks to Spark’s lazy execution and query optimizer, the two would have same performance.- Parameters
- nint
Number of rows to return.
- columnslabel or list of labels
Column label(s) to order by.
- Returns
- DataFrame
The first n rows ordered by the given columns in descending order.
See also
DataFrame.nsmallestReturn the first n rows ordered by columns in ascending order.
DataFrame.sort_valuesSort DataFrame by the values.
DataFrame.headReturn the first n rows without re-ordering.
Notes
This function cannot be used with all column types. For example, when specifying columns with object or category dtypes,
TypeErroris raised.Examples
>>> df = ps.DataFrame({'X': [1, 2, 3, 5, 6, 7, np.nan], ... 'Y': [6, 7, 8, 9, 10, 11, 12]}) >>> df X Y 0 1.0 6 1 2.0 7 2 3.0 8 3 5.0 9 4 6.0 10 5 7.0 11 6 NaN 12
In the following example, we will use
nlargestto select the three rows having the largest values in column “population”.>>> df.nlargest(n=3, columns='X') X Y 5 7.0 11 4 6.0 10 3 5.0 9
>>> df.nlargest(n=3, columns=['Y', 'X']) X Y 6 NaN 12 5 7.0 11 4 6.0 10