Compatibility
All Spark data types are defined in the pyspark.sql.types package in PySpark.
The table below shows how Spark data types are mapped to Python types and Arrow data types.
| Spark Data Type | PySpark API | Python Type | Arrow Data Type |
|---|---|---|---|
| NullType | NullType() | - | Null |
| BooleanType | BooleanType() | bool | Boolean |
| ByteType | ByteType() | int | Int8 |
| ShortType | ShortType() | int | Int16 |
| IntegerType | IntegerType() | int | Int32 |
| LongType | LongType() | int | Int64 |
| - | - | - | UInt8 UInt16 UInt32 UInt64 |
| - | - | - | Float16 |
| FloatType | FloatType() | float | Float32 |
| DoubleType | DoubleType() | float | Float64 |
| - | - | - | Decimal32 Decimal64 |
| DecimalType | DecimalType() | decimal.Decimal | Decimal128 Decimal256 |
| StringType | StringType() | str | Utf8 LargeUtf8 |
| CharType(n) | CharType(length: int) | str | Utf8 LargeUtf8 |
| VarcharType(n) | VarcharType(length: int) | str | Utf8 LargeUtf8 |
| - | - | - | Utf8View |
| BinaryType | BinaryType() | bytearray | Binary LargeBinary |
| - | - | - | FixedSizeBinary BinaryView |
| TimestampType | TimestampType() | datetime.datetime | Timestamp(Microsecond, TimeZone(_)) |
| TimestampNTZType | TimestampNTZType() | datetime.datetime | Timestamp(Microsecond, NoTimeZone) |
| - | - | - | Timestamp(Second, _) Timestamp(Millisecond, _) Timestamp(Nanosecond, _) |
| DateType | DateType() | datetime.date | Date32 |
| - | - | - | Date64 |
| TimeType | TimeType(precision: int = 6) | datetime.time | Time32(Second) Time32(Millisecond) Time64(Microsecond) |
| YearMonthIntervalType | YearMonthIntervalType() | - | Interval(YearMonth) |
| DayTimeIntervalType | DayTimeIntervalType() | datetime.timedelta | Duration(Microsecond) |
| CalendarIntervalType | CalendarIntervalType() | - | Interval(MonthDayNano) |
| - | - | - | Interval(DayTime) |
| - | - | - | Duration(Second) Duration(Millisecond) Duration(Nanosecond) |
| ArrayType | ArrayType(elementType, containsNull: bool = True) | listtuple | List |
| - | - | - | LargeList FixedSizeList ListView LargeListView |
| MapType | MapType(keyType, valueType, valueContainsNull: bool = True) | dict | Map |
| StructType | StructType(fields) | listtuple | Struct |
| - | - | - | Union |
| - | - | - | Dictionary |
| - | - | - | RunEndEncoded |
Notes
- DayTimeIntervalType in Spark has microsecond precision, and it is mapped to the Duration(Microsecond) Arrow type. It is not mapped to the Interval(DayTime) Arrow type which only has millisecond precision.
- YearMonthIntervalType and CalendarIntervalType in Spark are not supported in Python, so calling the
.collect()method will raise an error for a DataFrame that contains these types. - StringType, CharType(n), and VarcharType(n) in Spark are mapped to either the Utf8 or LargeUtf8 type in Arrow, depending on the
spark.sql.execution.arrow.useLargeVarTypesconfiguration option. - BinaryType in Spark is mapped to either the Binary or LargeBinary type in Arrow, depending on the
spark.sql.execution.arrow.useLargeVarTypesconfiguration option. - CalendarIntervalType in Spark has microsecond precision while the Interval(MonthDayNano) Arrow type has nanosecond precision. So the supported data range for calendar intervals is different between JVM Spark and Arrow.
- TimeType represents time of day values without a time zone. The
precisionparameter specifies the number of decimal digits following the decimal point in the seconds field. Spark 4.0 supports precision values0,3, and6(second, millisecond, and microsecond). The default precision is6(microsecond). Precision0and3map to Time32 in Arrow, while precision6maps to Time64(Microsecond) in Arrow. Precision9(nanosecond) is not supported by Spark 4.0.
