How to extract data from a column with variable number of values in each row in a CSV file?

Basically I'm trying to add together the count column values of item columns with the same name in a csv file. I then need to sort the results in ascending alphabetic order by item column values. For example:

Leading Cause, Deaths
Diabetes Mellitus, 123
Influenza and Pneumonia, 325
Diabetes Mellitus, 100

I need to add the values 123 and 100 to get a new row for Diabetes.

It should look like this:
    Diabetes Mellitus, 223.

This is the code I have so far:

import csv
import sys

with open(sys.argv[1], 'r') as File:
    reader = csv.reader(File)
    itemindex = sys.argv[2]
    countindex = sys.argv[3]
    item column = 0
    count column = 0
    first row = True
    dictionary = {}

    for row in reader:
       if firstrow == True:
          firstrow = False
          itemcolumn = row.index(itemindex)
          countcolumn = row.index(countindex)
           if item column in dictionary:
               # Add the item at the row's count column (converted to an int) to the
               # prexisting entry with that item column.
               #create a new entry in the dictionary

       print(itemindex + "," + countindex)

for key, value in sorted(dictionary)
    print(key + "," + value)

The commented parts are the ones I'm stuck on.

3 answers

  • answered 2018-03-13 20:39 M T Head

    You could try not using the libraries provided and just loop through the data as text. Parsing the text.

  • answered 2018-03-13 20:39 akozi

    If you have access you could use the pandas package to handle the csv.

    Text file titled values.txt

    Leading Cause, Deaths
    Diabetes Mellitus, 123
    Influenza and Pneumonia, 325
    Diabetes Mellitus, 1008

    The desired data frame can be achieved with:

    import pandas as pd
    data = pd.read_csv('values.txt')
    sum_data = data.groupby(['Leading Cause']).sum()
    print(sum_data.loc['Diabetes Mellitus'])

    Which would output

                 Leading Cause   Deaths
    0        Diabetes Mellitus      123
    1  Influenza and Pneumonia      325
    2        Diabetes Mellitus     1008
    Leading Cause                   
    Diabetes Mellitus           1131
    Influenza and Pneumonia      325
     Deaths    1131
    Name: Diabetes Mellitus, dtype: int64

  • answered 2018-03-13 20:39 martineau

    Using a collections.defaultdict, which is a specialized dictionary class, would make it easy:

    import collections
    import csv
    import os
    import sys
        filename = sys.argv[1]
        itemindex = int(sys.argv[2])
        countindex = int(sys.argv[3])
    except IndexError:
        print('Error:\n  Usage: {} <file name> <item index> <count index>'.format(
    with open(filename, 'r', newline='') as file:
        reader = csv.reader(file, skipinitialspace=True)
        next(reader)  # Skip first row.
        counter = collections.defaultdict(int)
        for row in reader:
            disease, deaths = row[itemindex], int(row[countindex])
            counter[disease] += deaths
    for key, value in sorted(counter.items()):
        print('{}, {}'.format(key, value))

    Sample usage:

    python3 diseases.csv 0 1    

    Sample output:

    Diabetes Mellitus, 223
    Influenza and Pneumonia, 325