Speed import csv data

Jack@007 · Post by **Jack@007** » Mon Mar 04, 2024 9:24 am

Hi,

I'm using VO2.8SP4b to write a tool that opens a comma seperated file, analyzes its structure and then reads chunk of data from it. The data chunck are all 1024 rows in size and 3 fields wide (10N0, 10N0, C3). The data chunck rows are stored in a barrayserver.
After the csv file is imported, the barrayserver is 3 fields wide and a multiple of 1024 records long.
A typical length would be 16 data chuncks = barrayserver with 16*1024 rows

Below is some core code to import the csv data.

What i can observe is the following:
The csv file is opened
The first few data chucks are imported rather quick
The following data chuncks are read in ever more slowly
The import is finished and takes just a few seconds (2-3).

When an other, new csv file is opened with also (or less than) 16 data chucks the import is almost instantaneously (1 sec.). If the csv is larger, the first 16 data chunks are imported quick but the following ever more slowly.
In a more extreme case (160 data chucks) the import takes about almost 3 minutes which is more than the expected 10x a few seconds.

Why is that? Is that a memory issue? How could I improve the import speed for he first import?

TIA
Jack

SELF:ba_spectra:Zap()
//(10N0, 10N0, C3)

SELF:o_csv:GoTop()
n_spec:=0

DO WHILE !SELF:o_csv:EoF

s_ln:=SELF:o_csv:ReadLn()
IF 'Spectrum:,' $ Left(s_ln, 10)
//new data chuck found
n_spec++

//all spectra are 1024 long
FOR i:=1 TO 1024
GetAppObject():Exec(EXECWHILEEVENT)
SELF:ba_spectra:Append()
s_ln:=o_csv:ReadLn()
a_res:=ParseCSVRecord(s_ln, '"', ',')
//return an array with fields
//every record is 5 fields wide; we only use 2
SELF:ba_spectra:FIELDPUT(1, Val(a_res[3]))
SELF:ba_spectra:FIELDPUT(2, Val(a_res[5]))
//also store the spectrum number
SELF:ba_spectra:FIELDPUT(3, NTrim(n_spec))
NEXT i
ENDDO

wriedmann · Post by **wriedmann** » Mon Mar 04, 2024 9:44 am

Hi Jack,
you could preallocate some data, and instead of filling through the arrayserver, fill the array directly, and then cut the not needed elements.
Wolfgang

ArneOrtlinghaus · Post by **ArneOrtlinghaus** » Mon Mar 04, 2024 10:51 am

There are surely too many dynamic data objects in memory (strings, arrays with strings or objects) so that the garbage collector is occupied too much.
If you need all the data then you will have to assign higher values to SetMaxDynSize and DynSize at program start.

If you do not have to show the data in a browser, then you should have to think if it is necessary to use a array server. When importing/processing big files it is normally better to "work with chunks of data" and free all processed data.

Arne

(Fortunately many tricky changes regarding the Garbage Collector have vanished with moving to Dotnet. Now we have still the 32bit limit and will change our applications to 64bit, but as long as we stay below the 2 or 3 GB dynamically memory we are normally ok)

Jamal · Post by **Jamal** » Mon Mar 04, 2024 9:19 pm

It could be the FieldPut operation on the ba_spectra dbServer object. So, use SuspendNotification() to suspend the broadcasting of Notify messages to the server's attached clients.
This should help alot with speed.

Sample:

Code: Select all

SELF:ba_spectra:SuspendNotification()

DO WHILE !SELF:o_csv:EoF
    // your code
END DO

SELF:ba_spectra:ResetNotification()

HTH,
Jamal

Jack@007 · Post by **Jack@007** » Tue Mar 05, 2024 10:32 am

Thanks for all your replies.

Based on your replies I tried to improve some things.
As Jamal suggested I tried the "suspendnotification" but that didn't do anything in my situation which seems logical as the barrayserver doesn't have any clients.

Next I tried increasing the memory size. I added "DynSize(DynInfoSize() * 2)" (from the VO help) in "Start" (Is this the right place?). This didn't seem to do anything either but I left it in the code.
Next I rewrote the code, got rid of the barrayserver and used a simple array. This improved things clearly.

I'm essentially dealing with 2 types of files. One contains 16 data chunks the other 160. The read speed of the 16 chunk version is now tolerable, just a few seconds.
The 160 chunk version not so much. Below is some benchmarking I did. Every 16 chunks (represented by a comma) is followed by DynInfoFree() and the time it took to read the 16 chunks.
The first 16 chunks are read in about a second. But it takes increasingly more time to read the same amount of data, as if VO needs ever more time to organize its memory management.

On a side note: Does that DynInfoFree() size seem correct?

, , , , , , , , , , , , , , , >> 15620688 >> 1.10
, , , , , , , , , , , , , , , >> 14491308 >> 2.75
, , , , , , , , , , , , , , , >> 13377272 >> 4.36
, , , , , , , , , , , , , , , >> 12241768 >> 6.04
, , , , , , , , , , , , , , , >> 11108508 >> 7.77
, , , , , , , , , , , , , , , >> 10005216 >> 9.14
, , , , , , , , , , , , , , , >> 8871676 >> 10.55
, , , , , , , , , , , , , , , >> 7737556 >> 13.06
, , , , , , , , , , , , , , , >> 6622684 >> 14.31

Total read time is about 90 sec.

A further optimization I did was to create the array on before hand. So not use Aadd() to add every row of data but use ArrayNew() to create the whole array on beforehand and assign the individual rows of data. This improved the total load time to about 47 seconds (let's call this a succes). More over it takes about the same time to read every chunk.

, , , , , , , , , , , , , , , >> 5505132 >> 4.18
, , , , , , , , , , , , , , , >> 5536940 >> 4.28
, , , , , , , , , , , , , , , >> 5512360 >> 4.22
, , , , , , , , , , , , , , , >> 5521860 >> 4.28
, , , , , , , , , , , , , , , >> 5522652 >> 4.25
, , , , , , , , , , , , , , , >> 5516696 >> 4.26
, , , , , , , , , , , , , , , >> 5532836 >> 4.28
, , , , , , , , , , , , , , , >> 5540328 >> 4.35
, , , , , , , , , , , , , , , >> 5514852 >> 4.26

Are there any other "tricks"/improvements I could try?
It would be nice to maintain the 16 chunk speed of about a second.

Jack

g.bunzel@domonet.de · Post by **g.bunzel@domonet.de** » Tue Mar 05, 2024 1:19 pm

Jack,

IF 'Spectrum:,' $ Left(s_ln, 10)
why $ ??
IF Left(s_ln, 10) == 'Spectrum:,'

Please show your code of ParseCSVRecord().

You should create your array only once with:
a_res := ArrayCreate(5)

and fill it inside your function ParseCSVRecord():
ParseCSVRecord(a_res, s_ln, '"', ',')
No need to always create a new array.

Does your bArrayServer have an index?

Gerhard

Chris · Post by **Chris** » Tue Mar 05, 2024 4:22 pm

Hi Jack,

One thing that will definitely help is to not call Exec(EXECWHILEEVENT) in every single iteration, this is overkill. Instead, call it just once in the DO WHILE loop and outside of the FOR. Not sure how much it will help, but if speed is still not good enough, as Gerhard said please post the complete (compilable) code, so we can also try it in our machines and do some profiling to see exactly where most time is lost.

MJI · Post by **MJI** » Wed Apr 24, 2024 2:59 pm

I have done some CSV reading and found a few things, using the append methods.

Suspend Reset works.
Load the lot then process.
A local data server (eg DBFCDX) works quicker than an array server.
If not huge, read the lot into a string, and process the string.

Speed import csv data

Speed import csv data

Re: Speed import csv data

Re: Speed import csv data

Re: Speed import csv data

Re: Speed import csv data

Re: Speed import csv data

Re: Speed import csv data

Re: Speed import csv data

Main menu