Does Data flow pick up records inserted in the source after DF is started?
To be more precise , say at a particular point of time when the DF is started the source has 10 records, during the DF execution 5 more records are inserted in the source, will those 5 be picked up in the same DF execution?
We tried to simulate and found that it is not picking up, but not sure if we were able to simulate it correctly, so wanted to know how the DF internally works and how will it behave in this scenario?
The reason I am concerned about this is because in our project we are using a DF on a source where records can concurrently be inserted through some other channel, so if DF picks up new records then it will take considerable time for it finish.
As I mentioned in my Post, I tried simulating the same thing which you mentioned i.e when I started the DF, that point there were 4 records and 4 uncommited records, in the middle of DF execution I commited the rest 4 and found DF didn't picked up those.
However I intend to retry the experiment after reducing the batch size to 2 (by default it is set as 1000), so it's possible that when DF started at that time it found record count to be 4,and batch size being 1000 fetched all 4 in first retrieve itself and hence never bothered to do a second fetch.
If I decrease the batch size to 2, then it will have to do more than one fetch, will like to see how it behaves then.
Posted: 4 years ago
Posted: 13 Jun 2018 4:58 EDT
Henrique Pereira (pereh)
Manager, Software Engineering
I'd be careful to base your solution whether DF will pick up those records or not, there are too many factors that could influence if they will be processed or not. By the comments put here I'm guessing you have a DB source and you are on track on one of these conditions, the batch size will also influence if a new records is processed or not.
If all records are read, then DB source will notify it's done and only then you insert new records, those will not be processed.
There are more factors that could influence that:
If your source class has keys then the records will be processed in order, meaning that if you insert a new record before a record being processed the new record won't be processed, but if you insert it after, it will.
If you have a partitioned source and you insert a record on a partition that didn't exist before then it will never be processed.
and that's only for DB, different sources could have different behaviors around that depending on their nature.