pyspark.SparkContext.sequenceFile#
- SparkContext.sequenceFile(path, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)[source]#
 Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is as follows:
A Java RDD is created from the SequenceFile or other InputFormat, and the key and value Writable classes
Serialization is attempted via Pickle pickling
If this fails, the fallback is to call ‘toString’ on each key and value
CPickleSerializeris used to deserialize pickled objects on the Python side
New in version 1.3.0.
- Parameters
 - pathstr
 path to sequencefile
- keyClass: str, optional
 fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
- valueClassstr, optional
 fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
- keyConverterstr, optional
 fully qualified name of a function returning key WritableConverter
- valueConverterstr, optional
 fully qualifiedname of a function returning value WritableConverter
- minSplitsint, optional
 minimum splits in dataset (default min(2, sc.defaultParallelism))
- batchSizeint, optional, default 0
 The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
- Returns
 RDDRDD of tuples of key and corresponding value
See also
Examples
>>> import os >>> import tempfile
Set the class of output format
>>> output_format_class = "org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat"
>>> with tempfile.TemporaryDirectory(prefix="sequenceFile") as d: ... path = os.path.join(d, "hadoop_file") ... ... # Write a temporary Hadoop file ... rdd = sc.parallelize([(1, {3.0: "bb"}), (2, {1.0: "aa"}), (3, {2.0: "dd"})]) ... rdd.saveAsNewAPIHadoopFile(path, output_format_class) ... ... collected = sorted(sc.sequenceFile(path).collect())
>>> collected [(1, {3.0: 'bb'}), (2, {1.0: 'aa'}), (3, {2.0: 'dd'})]