Compatibility
All Spark data types are defined in the pyspark.sql.types
package in PySpark.
The table below shows how Spark data types are mapped to Python types and Arrow data types.
Spark Data Type | PySpark API | Python Type | Arrow Data Type |
---|---|---|---|
NullType | NullType() | - | Null |
BooleanType | BooleanType() | bool | Boolean |
ByteType | ByteType() | int | Int8 |
ShortType | ShortType() | int | Int16 |
IntegerType | IntegerType() | int | Int32 |
LongType | LongType() | int | Int64 |
- | - | - | UInt8 UInt16 UInt32 UInt64 |
- | - | - | Float16 |
FloatType | FloatType() | float | Float32 |
DoubleType | DoubleType() | float | Float64 |
- | - | - | Decimal32 Decimal64 |
DecimalType | DecimalType() | decimal.Decimal | Decimal128 Decimal256 |
StringType | StringType() | str | Utf8 LargeUtf8 |
CharType(n) | CharType(length: int) | str | Utf8 LargeUtf8 |
VarcharType(n) | VarcharType(length: int) | str | Utf8 LargeUtf8 |
- | - | - | Utf8View |
BinaryType | BinaryType() | bytearray | Binary LargeBinary |
- | - | - | FixedSizeBinary BinaryView |
TimestampType | TimestampType() | datetime.datetime | Timestamp(Microsecond, TimeZone(_)) |
TimestampNTZType | TimestampNTZType() | datetime.datetime | Timestamp(Microsecond, NoTimeZone) |
- | - | - | Timestamp(Second, _) Timestamp(Millisecond, _) Timestamp(Nanosecond, _) |
DateType | DateType() | datetime.date | Date32 |
- | - | - | Date64 |
- | - | - | Time32(Second) Time32(Millisecond) Time64(Microsecond) Time64(Nanosecond) |
YearMonthIntervalType | YearMonthIntervalType() | - | Interval(YearMonth) |
DayTimeIntervalType | DayTimeIntervalType() | datetime.timedelta | Duration(Microsecond) |
CalendarIntervalType | CalendarIntervalType() | - | Interval(MonthDayNano) |
- | - | - | Interval(DayTime) |
- | - | - | Duration(Second) Duration(Millisecond) Duration(Nanosecond) |
ArrayType | ArrayType(elementType, containsNull: bool = True) | list tuple | List |
- | - | - | LargeList FixedSizeList ListView LargeListView |
MapType | MapType(keyType, valueType, valueContainsNull: bool = True) | dict | Map |
StructType | StructType(fields) | list tuple | Struct |
- | - | - | Union |
- | - | - | Dictionary |
- | - | - | RunEndEncoded |
Notes
- DayTimeIntervalType in Spark has microsecond precision, and it is mapped to the Duration(Microsecond) Arrow type. It is not mapped to the Interval(DayTime) Arrow type which only has millisecond precision.
- YearMonthIntervalType and CalendarIntervalType in Spark are not supported in Python, so calling the
.collect()
method will raise an error for a DataFrame that contains these types. - StringType, CharType(n), and VarcharType(n) in Spark are mapped to either the Utf8 or LargeUtf8 type in Arrow, depending on the
spark.sql.execution.arrow.useLargeVarTypes
configuration option. - BinaryType in Spark is mapped to either the Binary or LargeBinary type in Arrow, depending on the
spark.sql.execution.arrow.useLargeVarTypes
configuration option. - CalendarIntervalType in Spark has microsecond precision while the Interval(MonthDayNano) Arrow type has nanosecond precision. So the supported data range for calendar intervals is different between JVM Spark and Arrow.