If you are unable to create a new account, please email support@bspsoftware.com

 

DS Job ending before the commit has really returned

Started by eknight, 12 Mar 2013 03:37:50 AM

Previous topic - Next topic

eknight

Hi there,

Does anyone (MFGF) know if it is possible that a DS job ends before the last commit to the db has ended?

We're using DS 7.1.778.0 (ya, I know :-\ )

My situation is that I have a problem that seems to occur when the system is working with an unusually large workload. In these cases a bug occurs where it seems that rows that should have been inserted by job A are not found by job B. Job A is defined in TWS as a predecessor of job B and there is definitely a ~30 sec gap between the when job A ends and job B begins. However the simplest explanation for the problem is that job B did not have access to a newly inserted row at the time that it's select started.

The strange thing is that job B is using the "WITH UR" option in DB2 and so should have had access to the rows even if they are only in the bufferpool.

The best theory we have at the moment is that the DS job (which has a commit count of 100,000) ends before the last batch of rows have actually been written to the bufferpool. Considering the performance problems we had at that time it's possible that DB2 took >30 seconds to write to the bufferpool.

My question is if there was ever a bug in DS that the jobs were ending before a child process that may have been responsible for the DB2 communication had truly ended?

Thanks,
Erik

MFGF

I can't see how a jobstream could end while any of its children are still active. You can probably monitor the processes on the server to see if the rundsjob.exe process disappears before the databuild.exe process. I'd be very surprised if you find this, though.

It's quite possible that the database commit is taking longer than the gap between the two jobstreams, though. I would have imagined that "Read Uncommitted" isolation level would have allowed the second jobstream to see the rows regardless, though. It's a puzzle to be sure! :)

You might try adding a procedure node to the end of your first jobstream with a Delay(60) expression to add a 60 second delay - it might give your commit time to happen before the next jobstream launches?

Cheers!

MF.
Meep!

eknight

Quote from: MFGF on 14 Mar 2013 12:33:30 PM
I can't see how a jobstream could end while any of its children are still active.

Yeah I can't really picture this either but it's currently my best theory.

Quote from: MFGF on 14 Mar 2013 12:33:30 PM
You can probably monitor the processes on the server to see if the rundsjob.exe process disappears before the databuild.exe process.

This is a good idea, I've been trying to figure out something like that. The problem is the effect seems to only occur when the system is under an extreme workload. So I haven't been able to really reproduce the bug in a testing environment. But it's definitely there in our Production data.

Quote from: MFGF on 14 Mar 2013 12:33:30 PM

It's quite possible that the database commit is taking longer than the gap between the two jobstreams, though. I would have imagined that "Read Uncommitted" isolation level would have allowed the second jobstream to see the rows regardless,

This is the weirdest part. The commit shouldn't even matter. The only possibility is that the data isn't even in the bufferpage when the second job starts (30 seconds after the first job as officially ended).

We're also only on Fixpack 2 for DB2 9.7. Any chance that you're aware of a DB2 bug from back then that could explain this?

Another theory of mine is that in the ~30 seconds between the jobs that DB2 makes the bufferpool page inaccessible before it's finished writing to the tablespace. If job 2 starts exactly in that window it may be missing updates from that one bufferpage. This seems like it would be a pretty major bug in DB2 though.

MFGF

I'm afraid the most I know about DB2 is how to spell it :) I'd be very interested to hear what you manage to discover, though.

Good luck!!

MF.
Meep!

eknight

Yeah for sure. If I find anything I'll post it here.

So far I've pretty much confirmed that it can't be a bug in DB2. Luckily I've got access to one of my countries top DB2 experts and he's walked me through what happens in DB2 internally and so it doesn't seem possible that anything is going wrong there.

It's possible that the problem is in Tivoli though...