Add header to CSV without loading CSV

Question

Is there a way to add a header row to a CSV without loading the CSV into memory in python? I have an 18GB CSV I want to add a header to, and all the methods I've seen require loading the CSV into memory, which is obviously unfeasible.

Did you try the writeheader method of DictWriter? (docs.python.org/3/library/csv.html) I don't know the exact answer, just an idea: If you open the file with 'a' and try to use the writeheader, would it work? It would be good if you would do this trying with a copy of the original file, if there is something wrong. — Kroy
– Kroy, Commented Feb 1, 2017 at 14:45
See the updated answer for a speed comparison of the 3 methods. — Maximilian Peters
– Maximilian Peters, Commented Feb 6, 2017 at 12:44

user · Accepted Answer · 2017-02-06 12:51:27Z

11

You will need to rewrite the whole file. Simplest is not to use python

echo 'col1, col2, col2,... ' > out.csv
cat in.csv >> out.csv

Python based solutions will work at much higher levels and will be a lot slower. 18GB is a lot of data after all. Better to work with operating system functionality, which will be the fastest.

edited Feb 6, 2017 at 12:51

user

5,2869 gold badges54 silver badges81 bronze badges

answered Feb 1, 2017 at 14:33

e4c5

53.9k11 gold badges110 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Chris Arndt Over a year ago

And if you don't want to leave Python, you can always wrap this in a subprocess.call() call. Or even do the writing of the first line in Python like in the answer by @maximilian-peters.

Jean-François Fabre · Accepted Answer · 2017-02-01 15:00:50Z

9

Just use the fact that csv module iterates on the rows, so it never loads the whole file in memory

import csv

with open("huge_csv.csv") as fr, open("huge_output.csv","w",newline='') as fw:
    cr = csv.reader(fr)
    cw = csv.writer(fw)
    cw.writerow(["title1","title2","title3"])
    cw.writerows(cr)

using writerows ensure a very good speed. The memory is spared here. Everything is done line-by-line. Since the data is properly processed, you could even change the separator and/or the quoting in the output file.

edited Feb 1, 2017 at 15:00

answered Feb 1, 2017 at 14:36

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

Comments

Maximilian Peters · Accepted Answer · 2017-02-06 12:43:42Z

Here is a comparison of the three suggested solutions for a ~200 MB CSV file with 10^6 rows and 10 columns (n=50). The ratio stays approximately the same for larger and smaller files (10 MB to 8 GB).

cp:shutil:csv_reader 1:10:55

i.e. using the builtin cp function is approximately 55 times faster than using Python's csv module.

Computer:

regular HDD
Python 3.5.2 64-bit
Ubuntu 16.04
i7-3770

import csv
import random
import shutil
import time
import subprocess

rows = 1 * 10**3
cols = 10
repeats = 50

shell_script = '/tmp/csv.sh'
input_csv = '/tmp/temp.csv'
output_csv = '/tmp/huge_output.csv'
col_titles = ['titles_' + str(i) for i in range(cols)]

with open(shell_script, 'w') as f:
    f.write("#!/bin/bash\necho '{0}' > {1}\ncat {2} >> {1}".format(','.join(col_titles), output_csv, input_csv))

with open(shell_script, 'w') as f:
    f.write("echo '{0}' > {1}\ncat {2} >> {1}".format(','.join(col_titles), output_csv, input_csv))
subprocess.call(['chmod', '+x', shell_script])

run_times = dict([
    ('csv_writer', list()),
    ('external', list()),
    ('shutil', list())
])

def random_csv():
    with open(input_csv, 'w') as csvfile:
        csv_writer = csv.writer(csvfile, delimiter=',')
        for i in range(rows):
            csv_writer.writerow([str(random.random()) for i in range(cols)])
    with open(output_csv, 'w'):
        pass

for r in range(repeats):
    random_csv()
    #http://stackoverflow.com/a/41982368/2776376
    start_time = time.time()
    with open(input_csv) as fr, open(output_csv, "w", newline='') as fw:
        cr = csv.reader(fr)
        cw = csv.writer(fw)
        cw.writerow(col_titles)
        cw.writerows(cr)
    run_times['csv_writer'].append(time.time() - start_time)

    random_csv()
    #http://stackoverflow.com/a/41982383/2776376
    start_time = time.time()
    subprocess.call(['bash', shell_script])
    run_times['external'].append(time.time() - start_time)

    random_csv()
    #http://stackoverflow.com/a/41982383/2776376
    start_time = time.time()
    with open('header.txt', 'w') as header_file:
        header_file.write(','.join(col_titles))

    with open(output_csv, 'w') as new_file:
        with open('header.txt', 'r') as header_file, open(input_csv, 'r') as main_file:
            shutil.copyfileobj(header_file, new_file)
            shutil.copyfileobj(main_file, new_file)
    run_times['shutil'].append(time.time() - start_time)

    print('#'*20)
    for key in run_times:
        print('{0}: {1:.2f} seconds'.format(key, run_times[key][-1]))

print('#'*20)
print('Averages')
for key in run_times:
    print('{0}: {1:.2f} seconds'.format(key, sum(run_times[key])/len(run_times[key])))

If you really want to do it in Python, you could create the header file first and then merge it with your 2nd file via shutil.copyfileobj.

import shutil
with open('header.txt', 'w') as header_file:
    header_file.write('col1;col2;col3')

with open('new_file.csv', 'w') as new_file:
    with open('header.txt', 'r') as header_file, open('main.csv', 'r') as main_file:
        shutil.copyfileobj(header_file, new_file)
        shutil.copyfileobj(main_file, new_file)

This benchmark can't be right. Finishing in 0.13 seconds means reading and writing at 1538 MB/s simultaneously. A "regular HDD" can't do that. Probably the file system needs to be flushed.
@Wups: Thanks for checking! I'll run the benchmarks and try to make everything's flushed.

Collectives™ on Stack Overflow

Add header to CSV without loading CSV

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related