Initial problem: Simple. Given a folder of .JSON files, extract attributes and write them to another file. Instead of relying on my trusty Groovy, I took this opportunity to implement it in NodeJS.
First attempt was straightforward. Read the folder, for each file, parse JSON, open new file and write it out.
var folder = '/temp/json/';
for (var file of fs.readdirSync(folder)) {
var json = JSON.parse(fs.readFileSync(path.join(folder, file)));
var out = fs.createWriteStream(path.join(folder, file.slice(0, -5) + '.csv'));
for (var item of json.item) {
out.write(util.format('%s,%s\n', item.id, item.title));
}
out.end();
}
Note: Exception handling, file type checking, etc were removed to retain conciseness and focus on the relevant aspects.
Tested this on folder with 1 file first. Good, output is correct. Tested on 10 files. Same correct output. Now for the first batch of 1000.
Took some time to run, but only 0-byte output files were created. Rate of new file creation also slowed down over time. More tests with less files show that output were all written only after the program ends. Aha! Buffered writes.
That’s still fine, since I get the correct results at the end of the batch. But I get this error before I reach the end, which discards all my buffered writes…
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed – process out of memory
Not ready to give up (nor just repeat runs with smaller batches), I turned to Google.
This guy has the same problem: no writing before program ends.
http://grokbase.com/t/gg/nodejs/125e84345w/how-to-flush-a-writestream-before-the-program-is-done-executing
Event-Driven Model… Awkward for this case, but I refactored the script to trigger process.nextTick().
var folder = '/temp/json/';
for (var file of fs.readdirSync(folder)) {
process.nextTick(function(file) {
var json = JSON.parse(fs.readFileSync(path.join(folder, file)));
var out = fs.createWriteStream(path.join(folder, file.slice(0, -5) + '.csv'));
for (var item of json.item) {
out.write(util.format('%s,%s\n', item.id, item.title));
}
out.end();
}(file));
}
Nope, didn’t help. Is it because all calls were scheduled on the same “next tick”?
Let’s push each file to the subsequent tick.
var folder = '/temp/json/';
var files = fs.readdirSync(folder);
function json2csv(index) {
if (index >= files.length) return;
var file = files[index];
var json = JSON.parse(fs.readFileSync(path.join(folder, file)));
var out = fs.createWriteStream(path.join(folder, file.slice(0, -5) + '.csv'));
for (var item of json.item) {
out.write(util.format('%s,%s\n', item.id, item.title));
}
out.end();
process.nextTick(json2csv.bind(null, index+1));
}
process.nextTick(json2csv.bind(null, 0));
Still no. Time to try the 2nd suggestion. out.write() did return false after some writes.
var folder = '/temp/json/';
function json2csv(files, start) {
for (var i=start; i
And... it works! So much for starting with a 10-line script.
It may not be the best tool for the job (subjective), but sometimes it's more efficient to work with a tool you already know; imagine a NodeJS developer without Groovy knowledge would find this easier to write in Node than in Groovy/Bash/Perl/Python.
Disclaimer: I decided to continue pushing writes even when out.write() returns false to simplify the implementation, because I know each input file was only around 1MB, which is safe to buffer. If the input is unknown, writes within the same file may need to be deferred until drained (maybe by transforming the items into an input stream)