Using proper functional style in a file processing task

I have an input csv file and need to generate an output file that has one line for each input line. Each input line could be of a specific type (say "old" or "new") that can be determined only by processing the input line.

In addition to generating the output file, we also want to print the summary of how many lines of each type were in the input file. My actual task involves generating different SQLs based on the input line type, but to keep the example code focussed, I have kept the processing in the function proc-line simple. The function func determines what type an input line is -- again, I have kept it simple by randomly generating a type. The actual logic is more involved.

I have the following code and it does the job. However, to retain a functional style for the task of generating the summary, I chose to return a keyword to signify the type of each line and created a lazy sequence of these for generating the final summary. In an imperative style, we would simply increment a count for each line type. Generating a potentially large collection just for summarizing seems inefficient. Another consequence of the way I have coded it is the repetition of the (.write writer ...) portion. Ideally, I would code that just once.

Any suggestions for eliminating the two problems I have identified (and others)?

(ns file-proc.core
  (:require [ :as csv]
            [ :as io]))

(defn func [x]
  (rand-nth [true false]))

(defn proc-line [line writer]
  (if (func line)
    (do (.write writer (str line "\n")) :new)
    (do (.write writer (str (reverse line) "\n")) :old)))

(defn generate-report [from to]
     [reader (io/reader from)
      writer (io/writer to)]
     (->> (csv/read-csv reader)
          (map #(proc-line % writer))

2 answers

  • answered 2018-02-13 03:19 Minh Tuan Nguyen

    IMHO, I would separate the two different aspects: counting the frequencies and writing to a file:

    (defn count-lines
       ([lines] (count-lines lines 0 0))
       ([lines count-old count-new]
         (if-let [line (first lines)]
            (if (func line)
               (recur count-old (inc count-new) (rest lines))
               (recur (inc count-old) count-new (rest lines)))
            {:new count-new :old count-old})))
     (defn generate-report [from to]
       (with-open [reader (io/reader from)
                   writer (io/writer to)]
         (let [lines (rest (csv/read-csv reader))
               frequencies (count-lines lines)]
             (doseq [line lines]
                (.write writer (str line "\n"))))))

  • answered 2018-02-13 03:19 Taylor Wood

    I'd try to separate data processing from side-effects like reading/writing files. Hopefully this would allow the IO operations to stay at opposite boundaries of the pipeline, and the "middle" processing logic is agnostic of where the input comes from and where the output is going.

    (defn rand-bool [] (rand-nth [true false]))
    (defn proc-line [line]
      (if (rand-bool)
        [line :new]
        [(reverse line) :old]))

    proc-line no longer takes a writer, it only cares about the line and it returns a vector/2-tuple of the processed line along with a keyword. It doesn't concern itself with string formatting either—we should let csv/write-csv do that. Now you could do something like this:

    (defn process-lines [reader]
      (->> (csv/read-csv reader)
           (map proc-line)))
    (defn generate-report [from to]
      (with-open [reader (io/reader from)
                  writer (io/writer to)]
        (let [lines (process-lines reader)]
          (csv/write-csv writer (map first lines))
          (frequencies (map second lines)))))

    This will work but it's going to realize/keep the entire input sequence in memory, which you don't want for large files. We need a way to keep this pipeline lazy/efficient, but we also have to produce two "streams" from one and in a single pass: the processed lines only to be sent to write-csv, and each line's metadata for calculating frequencies. One "easy" way to do this is to introduce some mutability to track the metadata frequencies as the lazy sequence is consumed by write-csv:

    (defn generate-report [from to]
      (with-open [reader (io/reader from)
                  writer (io/writer to)]
        (let [freqs (atom {})]
          (->> (csv/read-csv reader)
               ;; processing starts
               (map (fn [line]
                      (let [[row tag] (proc-line line)]
                        (swap! freqs update tag (fnil inc 0))
               ;; processing ends
               (csv/write-csv writer))

    I removed the process-lines call to make the full pipeline more apparent. By the time write-csv has fully (and lazily) consumed its payload, freqs will be a map like {:old 23, :new 31} which will be the return value of generate-report. There's room for improvement/generalization, but I think this is a start.