Stop Rolling Your Own CSV Parser!
secretGeek .:dot Nuts about dot Net:.
home .: about .: sign up .: sitemap .: secretGeek RSS

Stop Rolling Your Own CSV Parser!

Would you write your own XML Parser? Only if you're f***ing crazy.

Yet developers constantly write their own "little" csv parsers.

How does this madness occur?

Step 1 -- Ignorance

"Oh this will be easy, I'll just read the file one line at a time, calling String.Split(',') to break each line into an array.

"Then I'll be able to refer to each item by number."

(You're already headed for stormy water... anything you do from now on will only drive you into the rocks harder and faster...)

continues...

Step 2 -- First Doubts

"Oops. I need to handle for commas, which are either escaped (by prefixing them with a special symbol) or contained inside quotes."

(So you decide to use regular expressions. After a bit of tinkering you've got a nice little regular expression that seems to work.)

(That ringing in your ears is Jamie Zawinski saying "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems. )

Step 3 -- Uh oh

"The quotes worked good at first... but quotes need to be escaped too. And sometimes there's double quotes, sometimes single quotes. Easy -- I'll just fix my regular expression."

Step 4 -- The Descent into Chaos

You start to adopt a 'test-driven' approach, only it's more of a 'panic driven' approach. You write numerous test cases for your unwieldy csv parser. It behaves nice.

You test it on more real world examples... it breaks your existing code and you need a new test case or two.

You begin to add new test cases, and trying always to do the simplest thing that will get the code to work.

It's now eight weeks since you said "I know! I'll just use String.Split(...". You have grown a long beard, which is particularly annoying as you are a woman. You have lost all boundaries in regard to personal hygiene. Managers circle your desk like vultures circling a wounded leopard.

Step 5 -- Enough!

You lift your head from the keyboard for just moment when a thought strike you. The problems you are facing have been faced before. You are re-inventing the wheel.

You download a code sample from the internet, and use your test cases to try them out. The downloaded code is much worse than what you've written yourself.

You download more samples from the internet. They're all broken. In. Different. Ways.

When you try to contact the developers of each library to see how they work, you find that the developers have generally retired and/or passed away and/or quit working in the IT industry. You consider how fortunate they are.

Step 5 -- Help me!

You go to the blog of someone you know and trust. You email that person. That person writes back and says, in big letters:

Just Use Marcos Meli's File Helpers.

The great thing about File Helpers is not just that it works, but that it is actively developed by Marcos, and if you need an improvement to it, you can contact Marcos (marcosdotnet at yahoo.com.ar). He's a real person who cares about getting his library to work properly. He's not just stopping at a 'good-enough' solution.

The other, and perhaps even greater, advantage is one you never dreamed of. Now you don't have to refer to fields by number. No more "myArray[4]" -- you can now say "myCustomer.Id".

The resulting code is so readable that you'll survive your next code inspection without getting your arms and legs torn off by Terry (Head Code Nazi and leader of the local chapter of The Programming Gestapo).

You can stop re-inventing the wheel and get on with your day job: cranking out more bugs, faster.

(But it's a good thing this experience gave you a chance to try out test driven development!)





'Daniel' on Wed, 13 Sep 2006 01:34:23 GMT, sez:

In the past I've used the ODBC text drivers. Haven't looked at a Dot Net equivalent though.



'lb' on Wed, 13 Sep 2006 01:52:50 GMT, sez:

i was gonna do a side track about the ODBC text drivers. they're excellent and worth while -- but they're still a potential dead end.

two problem i have with them. one, you've got this opaque technique with it's own limitations (e.g. max 256 columns [might be out of date]) that act as real brick walls when/if you hit them.

second, i see treating a csv file as a database table to be a potentially 'leaky abstraction' (in JoelSpeak)

having said that ymmv of course.



'Erv' on Wed, 13 Sep 2006 02:05:30 GMT, sez:

They do look interesting, but unfortunatly, they have an unacceptable license (at least for the company I work for). We'll have to keep rolling our own :(



'lb' on Wed, 13 Sep 2006 02:08:12 GMT, sez:

who doesn't have an acceptable license? the odbc text drivers? or filehelpers?



'Gavin' on Wed, 13 Sep 2006 04:01:26 GMT, sez:

I started reading: "Stop rolling your own CVS parser" and thought "What the hell!?!? nobody does that! Is this guy just making up stories now!?"

and then it clicked.

And then I felt really, really guilty.



'Farmer Jeb' on Wed, 13 Sep 2006 04:34:13 GMT, sez:

Are you just having a go at me because you KNOW I do this?!



'lb' on Wed, 13 Sep 2006 04:36:55 GMT, sez:

hey i do it too. over and over. in the last week i've worked with three different groups of people who have all done it too. and we've all hit the same problems. this is more of a go at myself than anyone jeb -- though the stringbuilder lesson we learnt today certainly *did* bring you to mind.



'Farmer JEB' on Wed, 13 Sep 2006 04:45:35 GMT, sez:

Yes, of course I told you about how when I eventually (1.5 months later) followed your suggestion of using a StringBuilder it sped my program up several orders of magnitude. The latest innovation was to write a wrapper around Split to put its results in a List instead of an array (to facilitate Removing subarrays of data efficiently).



'Dave' on Wed, 13 Sep 2006 05:09:04 GMT, sez:

Wow, what a coincidence. I just wrote a CSV parser today. It's certainly not the first time I've done this and it's not quite as bad as you make it out to be :) I'd never consider regular expressions for this so maybe that's why I don't have as many problems.

Historically, my issues tend to be with the people on the other end who like to randomly decide to move columns around, add new columns, etc. without telling me. No library can help with that unfortunately.



'Chris Wallace' on Wed, 13 Sep 2006 09:29:17 GMT, sez:

I stick to the good ole' pipe | as it "should" never be in legitimate text.



'anon' on Wed, 13 Sep 2006 10:01:06 GMT, sez:

Don't you work for a company that has built a 'little' csv parser?

http://msdn2.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx



'Marcos' on Wed, 13 Sep 2006 11:29:49 GMT, sez:

Leon
Thanks a lot for spread the FileHelpers to the community and for your comments about it =)

Dave:

"Historically, my issues tend to be with the people on the other end who like to randomly decide to move columns around, add new columns, etc. without telling me. No library can help with that unfortunately."

If you the FileHelpers you only need to move a field up or down and you can make some checks of types and length to ensure that your files are not modified.

Cheers





'b0n' on Wed, 13 Sep 2006 12:19:29 GMT, sez:

This sounds fantastic. I was always too humble/indignant to write my own CSV parser, but instead wasted those 1.5 months searching for decent code on the Internet. What a load. It's nice to see somebody focusing on the basics.

Now if you could get Microsoft to package this with every crappy copy of SQL Server Express, you'd be up for a Nobel peace prize. I was in a warlike mood most of the time. It should be easy!!!

Thanks for pointing it out Leon. Thanks for doing it Carlos.



'lb' on Wed, 13 Sep 2006 20:48:22 GMT, sez:

>Don't you work for a company that has
>built a 'little' csv parser?

hey i don't work for microsoft! i work for advantechsoftware.com.au
We're better than microsoft ;-) Smaller though.



'Haacked' on Wed, 13 Sep 2006 23:28:04 GMT, sez:

Why write a CSV parser when you can write a domain specific programming language for parsing CSV files?



'A Nony Mouse' on Mon, 18 Sep 2006 18:07:27 GMT, sez:

Score one for XML - you are guaranteed a parser that can work

There is "Standard" for CSV files so any solution can fail on your specific data, and probably it does.



'Julian Morrison' on Wed, 08 Nov 2006 07:09:40 GMT, sez:

Experienced coders would spot the danger instantly. Any delimited format holding arbitrary data needs quoting, which rules out pure regex and implies a need for escapes. Someone who's been around the block a few times will see all of the above in a single glance and go straight to a parser generator.



'Dave' on Wed, 08 Nov 2006 11:18:58 GMT, sez:

Writing a CSV parser is not hard, takes only a few hours tops with unit tests--but only for programmers who understand the facts you list. I hope all beginning programmers see this webpage and decide whether they understand the rules or not.

How to write it: Write the parser until it starts to run on simple cases. Then stop. Figure out what few simple homemade parsing functions would help make you parser handle all official rules. Then start all over and write the parser so that it is relatively simple due to calling your smart parsing functions.

If you control the data being input, you can restrict it to a subset of CSV rules. Programming that takes only minutes.

Write it yourself to avoid licensing problems. Also so that you can re-use it on other projects.

But, if you are reading in major amounts of data from outside of your control, you should buy (with money) a real full-blown CSV parser library. Or, expect to spend months on all the degenerative cases for poorly formed input data. Otherwise, what happens when your client Foobar Industries sends you a gigabyte file of pseudo CSV and you must load it *today*. It doesn't quite follow the official rules. Your parser needs to be very smart.



'Pierre' on Wed, 08 Nov 2006 14:24:05 GMT, sez:

Thanks for the link to Marcos, looks like a great utility.

(BTW I did write my own CSV/tab-delimited parser in a couple of days).

Also, the TimeSnapper plug is fantastic. Just what I've been looking for.



'RegExp Santero' on Thu, 09 Nov 2006 01:16:07 GMT, sez:

This regex matches quotes containing backslashed quotes signs:

"[^\\"]*(\\.[^\\"]*)*"

It's pure evil, but I learnt it by heart (if you stare at it long enough it starts making sense) and now it serves me well.

Once you can barf that off your fingers at command csv parsing it's finally feasible.



'Sébastien Lorion' on Wed, 22 Nov 2006 16:55:56 GMT, sez:

If you need *real* performance AND flexibility, you can try my CsvParser library on CodeProject:

http://www.codeproject.com/cs/database/CsvReader.asp

I know ... shameless plug ;)



'Sébastien Lorion' on Wed, 22 Nov 2006 17:00:33 GMT, sez:

BTW, nice blog, just discovered it ;)



'Percival' on Fri, 24 Nov 2006 03:58:58 GMT, sez:

I've used Sebastien's CSV parser --- WELL DONE mate! Excellent and integrated with what I was doing... But then I found Odbc Text drivers and it allows csv to be queried using SQL - so i stuck with it. But all in all I give a big thumbs up for Sebastien's piece of excellent work -- and it's free too! Thanks.



'Percival' on Fri, 24 Nov 2006 04:01:02 GMT, sez:

BTW... I found Seb's CSV parser about 5 months ago and it was the best (and fastest) by far after going through weeks of testing the others. However, I moved to Odbc text drivers only because of SQL query capabilities. I am keeping Seb's CSV parser close at hand in case I may require it in future. Thanks again Seb.



'Sébastien Lorion' on Mon, 27 Nov 2006 08:55:40 GMT, sez:

Your welcome! Always glad to be able to pay back when I can ;)



'http://mandar.date.googlepages.com' on Fri, 02 Mar 2007 11:03:51 GMT, sez:

If you want simple function, try this..
http://www.codeproject.com/useritems/Basic_CSV_Parser_Function.asp



'lb' on Sat, 03 Mar 2007 03:15:52 GMT, sez:

Thanks Mandar -- i like the approach of your code.



'TheDigitalHippy' on Fri, 09 Mar 2007 21:30:14 GMT, sez:

Ok. I'll admit it.. I've done the parser thing. Well, the vultures (managers) wouldn't get near my desk. I bite.

But the "Error: unknown error" is really freaking me out man!

Thanks for the link!
TheDigitalHippy



'Jeff Zanooda' on Sat, 10 Mar 2007 00:28:41 GMT, sez:

I'm sorry, I don't get it. What's so hard about writing a CSV parser? It's just a finite state machine (one state variable + one large switch statement). And why would one want to use regular expressions here?



'radix' on Sat, 10 Mar 2007 08:15:04 GMT, sez:

@jeff:

well, then sit down and try it.
this is not talking about data that you can control, this is about handling data which you can`t and which is not (allways) following rules.
But still you have to turn garbage to data.



'zepolen' on Sat, 10 Mar 2007 08:34:18 GMT, sez:

or you could just make sure you escape the commas before they go into the csv file...a lot simpler don't ya think?

<?
$file='bob.csv';
$fieldnames = array('ID', 'Name', 'Description');
$lines = file($file);
$l = 0;
foreach($lines as $line) {
$line = str_replace('\\,', '$$££**COMMA**$$££', $line);
$fields = explode(',', $line);
$i = 0;
foreach($fieldnames as $fieldname) {
$value = str_replace('$$££**COMMA**$$££', ',', $fields[++$i]);
$csv[$l][$fieldname] = $value;
}
$l++;
}
print_r($csv);
?>

even php can take care of it



'lb' on Sat, 10 Mar 2007 09:48:00 GMT, sez:

"escape the commas before they go into the csv file"

sorry Zepolen -- but generally when you have a nasty csv parsing problem, you're dealing with someone else's csv -- so how you would write it is irrelevant.

i like (and am a little amused by) your code -- replacing the commas with a weird string, then replacing the weird string later. i've done it myself... but it's not a great trick in the long run.

lb



'pop' on Sat, 10 Mar 2007 12:28:16 GMT, sez:

@zepolen check out fgetcsv in the PHP manual.



'Bob' on Sun, 11 Mar 2007 21:41:18 GMT, sez:

Use tab-delimited. Anything that can output CSV can output tab-delimited. About the only thing that makes a mess is storing file locations on Windows.



'lb' on Sun, 11 Mar 2007 21:43:53 GMT, sez:

bob you freakin reddit troll nitwit --

"generally when you have a nasty csv parsing problem, you're dealing with someone else's csv"

got that?? it's not about how you GENERATE the csv -- it's about how you parse it.

And TABS offer no benefit over commas, you undergrad, inexperienced twerp.

damn i'm sick of stupid comments. I blame Des Traynor ;-)



'Augusto' on Wed, 14 Mar 2007 00:42:02 GMT, sez:

Sadly, File Helpers is not cross platform enough. For something so simple, that's a big limitation.



Comments closed due to spam. Sorry.

Articles

21 tools used in our MicroISV 21 tools used in our MicroISV
Lost Treasures of the DOS World: tree! Lost Treasures of the DOS World: tree!
The Virtual Machine Machine and the Virtual Virtual Machine The Virtual Machine Machine and the Virtual Virtual Machine
Should Linq To Sql Go Should Linq To Sql Go "Open Source"?
Redux: New Synchronisation Idea Overlooked By Microsoft Redux: New Synchronisation Idea Overlooked By Microsoft
New Synchronisation Idea Overlooked By Microsoft Live team New Synchronisation Idea Overlooked By Microsoft Live team
Visual Studio UX Taskforce, Office UX Taskforce... etc. Visual Studio UX Taskforce, Office UX Taskforce... etc.
How to be Jeff Atwood How to be Jeff Atwood
Reuse good -- Abstractions better! Reuse good -- Abstractions better!
Word of the day: Upsert Word of the day: Upsert
awesome Atli tooltip awesome Atli tooltip
Leaky Abstractions? Try Asp.Net! Leaky Abstractions? Try Asp.Net!
A truly original thought A truly original thought
Throw New VirtualMachine(); Throw New VirtualMachine();
Undo, redo, predo, preundo Undo, redo, predo, preundo
A little A little "bit" of programming therapy
Coding Koan: the power of one Coding Koan: the power of one
Behavior Driven Development: As Human As Possible Behavior Driven Development: As Human As Possible
What To (Really) Do If You Find Out Your Parents Are Using Vista (redux) What To (Really) Do If You Find Out Your Parents Are Using Vista (redux)
What To Do If You Find Out Your Parents Are Using Vista What To Do If You Find Out Your Parents Are Using Vista
Sample Code From Text-Adventure Game Platforms Sample Code From Text-Adventure Game Platforms
TimeSnapper 3.0 -- an interactive, bubbling cauldron of possibilities TimeSnapper 3.0 -- an interactive, bubbling cauldron of possibilities
The laptop compubody sock The laptop compubody sock
Everything that's bad for you is suddenly good for you! Everything that's bad for you is suddenly good for you!
Everything I know about Code Reviews I learnt from Star Wars (and JCooney) Everything I know about Code Reviews I learnt from Star Wars (and JCooney)
Syntax highlighting of strings Syntax highlighting of strings
Google AppEngine: evil virus or viral evil? Google AppEngine: evil virus or viral evil?
Workflow software: I'm calling the bluff. Workflow software: I'm calling the bluff.

Archives .: secretGeek :: Complete Archives :.
25 steps for building a Micro-ISV 25 steps for building a Micro-ISV
3 minute guides -- babysteps in new technologies: powershell, JSON, watir, F# 3 Minute Guide Series
Top 10 SecretGeek articles Top 10 SecretGeek articles

Downloads

TimeSnapper -- Automated Screenshot Journal TimeSnapper.com    
Version 2.5: with password protection

ShinyPower (help with Powershell) ShinyPower
Now at CodePlex

Next Action NextAction
Managing the top of your mind



[powered by Google] 


Thai Erawan, Brisbane Restaurant, delicious thai food in paddington Thai Erawan, Brisbane Restaurant
World's Simplest Code Generator (html edition) World's Simplest Code Generator
Gradient Maker -- a tool for making background images that blend from one colour to another. Forget photoshop, this is the bomb. Gradient Maker
How to be depressed How to be depressed
You are not inadequate.



Recommended Reading

The Best Software Writing I
The Business Of Software (Eric Sink)

Recommended blogs

Jeff Atwood
Reginald Braithwaite
Joseph Cooney
Phil Haack
Scott Hanselman
Julia Lerman
Joel Pobar
Eric Sink
Joel Spolsky
Des Traynor

Aggregated Links

programming.reddit.com
dzone
dot net kicks

Human Link Machines

interesting finds
a continuous learner's weblog
arjan's world
n links today
new and notable
morning coffee
learning .net
weekly link post
(my del.icio.us account)

LinkedIn profile
 
home .: about .: sign up .: sitemap .: secretGeek RSS .: © Leon Bambrick 2006 .: privacy

home .: about .: sign up .: sitemap .: RSS .: © Leon Bambrick 2006 .: privacy