Can Spark SQL not count correctly or can I not write SQL correctly?

In a Python notebook on Databricks "Community Edition", I'm experimenting with the City of San Francisco open data about emergency calls to 911 requesting firefighters. (The old 2016 copy of the data used in "Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data" (YouTube) and made available on S3 for that tutorial.)

After mounting the data and reading it with the explicitly defined schema into a DataFrame fire_service_calls_df, I aliased that DataFrame as an SQL table:

sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")

With that and the DataFrame API, I can count the call types that occurred:

fire_service_calls_df.select('CallType').distinct().count()
Out[n]: 34

... or with SQL in Python:

spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
+------------------------+
|count(DISTINCT CallType)|
+------------------------+
|                      33|
+------------------------+

... or with an SQL cell:

%sql

SELECT count(DISTINCT CallType)
FROM fireServiceCalls

Databricks table output with column "count(DISTINCT CallType)" and single value "33".

Why do I get two different count results? (It seems like 34 is the correct one, even though the talk in the video and the accompanying tutorial notebook mention "35".)

1 answer

  • answered 2018-03-13 20:27 das-g

    To answer the question

    Can Spark SQL not count correctly or can I not write SQL correctly?

    from the title: I can't write SQL correctly.

    Rule <insert number> of writing SQL: Think about NULL and UNDEFINED.

    %sql
    SELECT count(*)
    FROM (
      SELECT DISTINCT CallType
      FROM fireServiceCalls 
    )
    

    34

    Also, i apparently can't read:

    pault suggested in a comment

    With only 30 something values, you could just sort and print all the distinct items to see where the difference is.

    Well, I actually thought of that myself. (Minus the sorting.) Except, there wasn't any difference, there were always 34 call types in the output, whether I generated it with SQL or DataFrame queries. I simply didn't notice that one of them was ominously named null:

    +--------------------------------------------+
    |CallType                                    |
    +--------------------------------------------+
    |Elevator / Escalator Rescue                 |
    |Marine Fire                                 |
    |Aircraft Emergency                          |
    |Confined Space / Structure Collapse         |
    |Administrative                              |
    |Alarms                                      |
    |Odor (Strange / Unknown)                    |
    |Lightning Strike (Investigation)            |
    |null                                        |
    |Citizen Assist / Service Call               |
    |HazMat                                      |
    |Watercraft in Distress                      |
    |Explosion                                   |
    |Oil Spill                                   |
    |Vehicle Fire                                |
    |Suspicious Package                          |
    |Train / Rail Fire                           |
    |Extrication / Entrapped (Machinery, Vehicle)|
    |Other                                       |
    |Transfer                                    |
    |Outside Fire                                |
    |Traffic Collision                           |
    |Assist Police                               |
    |Gas Leak (Natural and LP Gases)             |
    |Water Rescue                                |
    |Electrical Hazard                           |
    |High Angle Rescue                           |
    |Structure Fire                              |
    |Industrial Accidents                        |
    |Medical Incident                            |
    |Mutual Aid / Assist Outside Agency          |
    |Fuel Spill                                  |
    |Smoke Investigation (Outside)               |
    |Train / Rail Incident                       |
    +--------------------------------------------+