Compatibility
All Spark data types are defined in the pyspark.sql.types package in PySpark.
The table below shows how Spark data types are mapped to Python types and Arrow data types.
| Spark Data Type | PySpark API | Python Type | Arrow Data Type |
|---|---|---|---|
| NullType | NullType() | - | Null |
| BooleanType | BooleanType() | bool | Boolean |
| ByteType | ByteType() | int | Int8 |
| ShortType | ShortType() | int | Int16 |
| IntegerType | IntegerType() | int | Int32 |
| LongType | LongType() | int | Int64 |
| - | - | - | UInt8 UInt16 UInt32 UInt64 |
| - | - | - | Float16 |
| FloatType | FloatType() | float | Float32 |
| DoubleType | DoubleType() | float | Float64 |
| - | - | - | Decimal32 Decimal64 |
| DecimalType | DecimalType() | decimal.Decimal | Decimal128 Decimal256 |
| StringType | StringType() | str | Utf8 LargeUtf8 |
| CharType(n) | CharType(length: int) | str | Utf8 LargeUtf8 |
| VarcharType(n) | VarcharType(length: int) | str | Utf8 LargeUtf8 |
| - | - | - | Utf8View |
| BinaryType | BinaryType() | bytearray | Binary LargeBinary |
| - | - | - | FixedSizeBinary BinaryView |
| TimestampType | TimestampType() | datetime.datetime | Timestamp(Microsecond, TimeZone(_)) |
| TimestampNTZType | TimestampNTZType() | datetime.datetime | Timestamp(Microsecond, NoTimeZone) |
| - | - | - | Timestamp(Second, _) Timestamp(Millisecond, _) Timestamp(Nanosecond, _) |
| DateType | DateType() | datetime.date | Date32 |
| - | - | - | Date64 |
| - | - | - | Time32(Second) Time32(Millisecond) Time64(Microsecond) Time64(Nanosecond) |
| YearMonthIntervalType | YearMonthIntervalType() | - | Interval(YearMonth) |
| DayTimeIntervalType | DayTimeIntervalType() | datetime.timedelta | Duration(Microsecond) |
| CalendarIntervalType | CalendarIntervalType() | - | Interval(MonthDayNano) |
| - | - | - | Interval(DayTime) |
| - | - | - | Duration(Second) Duration(Millisecond) Duration(Nanosecond) |
| ArrayType | ArrayType(elementType, containsNull: bool = True) | listtuple | List |
| - | - | - | LargeList FixedSizeList ListView LargeListView |
| MapType | MapType(keyType, valueType, valueContainsNull: bool = True) | dict | Map |
| StructType | StructType(fields) | listtuple | Struct |
| - | - | - | Union |
| - | - | - | Dictionary |
| - | - | - | RunEndEncoded |
Notes
- DayTimeIntervalType in Spark has microsecond precision, and it is mapped to the Duration(Microsecond) Arrow type. It is not mapped to the Interval(DayTime) Arrow type which only has millisecond precision.
- YearMonthIntervalType and CalendarIntervalType in Spark are not supported in Python, so calling the
.collect()method will raise an error for a DataFrame that contains these types. - StringType, CharType(n), and VarcharType(n) in Spark are mapped to either the Utf8 or LargeUtf8 type in Arrow, depending on the
spark.sql.execution.arrow.useLargeVarTypesconfiguration option. - BinaryType in Spark is mapped to either the Binary or LargeBinary type in Arrow, depending on the
spark.sql.execution.arrow.useLargeVarTypesconfiguration option. - CalendarIntervalType in Spark has microsecond precision while the Interval(MonthDayNano) Arrow type has nanosecond precision. So the supported data range for calendar intervals is different between JVM Spark and Arrow.
