If you are unable to create a new account, please email support@bspsoftware.com

 

News:

MetaManager - Administrative Tools for IBM Cognos
Pricing starting at $2,100
Download Now    Learn More

Main Menu

the performance of data manager

Started by bluetear, 15 Dec 2011 08:46:16 PM

Previous topic - Next topic

bluetear

hello everyone.
i installed the data manager in the windows 32 server and the database(oracle) is on the linux os. when i run a fact build .the data manager used less than 100m memory, and the linux load average is low ,just 1 ,if i run two builds ,the load will  bigger than 1. the process speed is slowly than i expect.so my question is how to inprove the process speed of the build.
i suppose ,in order to inprove the problem.
1. tune the sql.(this is main reason)
2. the memory set on fact build properties,if the column is all coming from the data source ,and the memory set is as default  ,then  i can't find the build needs how much memory ,and the number of peak .so i can't set the memory properties. is someone can help me solve the problem?
3 network transmission .in the loal  area network .the transmission is limited. maybe i can't impove it .
4 the io problem in the linux server. it is a pc server ,and i can't change the hardware,but i can get use of the max io.
5 the database problem.
the above problem ,could anybody give me a help. for the problem 1,3,4,5,i can solve it in my way. if somebody have any other ideas ,please post it .thanks. and anybody  can give me some suggestion about  the memory problem
in addition if you have any other suggestion or idea.posting  it
finally thank anyone that posting your idea or advice.
bluetear

MFGF

You need to figure out where the performance bottleneck is occurring. First of all, turn on detail logging for the build next time you run it, and see how long the SQL query is taking before it starts returning rows. You could also try pasting the SQL into SQLTerm and running it from there to see how long it takes to respond.

To see how much memory the build is using, turn on internal logging before you run. Unless you have merging or aggregation enabled, the memory footprint should be very low, since the build will be extracting, processing and delivering only a single row of data at a time, so unless each row is massive the memory required to hold and process a row should be negligable. You can post extracts from the log if you don't understand how to figure out the memory stats.

You should bear in mind that your build is run as a single process, so the OS will probably allocate it to a single core/single CPU. You can monitor how your CPU is handling things from within linux to see if CPU is a bottleneck.

What refresh type is set in your build? Are you updating the target table? Are there any indexes in place if you are inserting rows?

Regards,

MF.
Meep!

bluetear

firstly thank MF. for answer my question and  sharing your knowledge.
tuning sql is a important point of improving the performance.certainly i am doing some job on it.

if i just extract from the data source,and deliver without any dimensions reference and any transformations,the memory is setting default, the duplicate key handling is allowing records with duplicate keys. the result see below :

[INTERNAL     - 09:18:59] Mem(M): 0.0 [Peak=0.0]  (Ref=0.0 Domain=0.0 Pool=0.0)
[PROGRESS   - 09:18:59] Delivery 'T_TEST_FT': 1470478 insert(s), 0 update(s) to 'TEST.T_TEST_FT'
[DETAIL         - 09:18:59] Data Stream row limits: sample rate 1, max rows 0
[DETAIL         - 09:18:59] Data Source 'T_TEST_FT': 1470478 physical ( 1470478 logical) rows read, 1470478 used
[PROGRESS   - 09:18:59] Acquisition: 1470478 accepted, 0 rejected
[PROGRESS   - 09:18:59] Delivery   : 1470478 direct, 0 summary/merge, 1470478 total
[PROGRESS   - 09:18:59] Done - 0 00:02:39 elapsed databuild -- completed (21-Dec-2011 09:18:59)

in the log,i find the mem(m) is allways 0 , if it have a dimension reference, then the mem(m) is changing.

[INTERNAL     - 09:03:00] Mem(M): 0.1 [Peak=0.1] (Ref=0.1 Domain=0.0 Pool=0.0)
[PROGRESS   - 09:03:00] Delivery 'T_TEST_FT': 1470478 insert(s), 0 update(s) to 'TEST.T_TEST_FT'
[INTERNAL   - 09:03:00] Dimension 'SIZE' - actual domain size 1 (peak 1)

maybe my design is pool and not  Standardized. in the above log, the memory is not reaching the limit.the OS load is low. and  cpus are idle.refresh type is setting as truncate,no indexes. any other setting are as default.i can't find any other message the memory using in the log.

as the refresh type. i tested that update or update/insert is slow ,it updates one row in each time. but append and truncate is fast.

as the dimension reference,the design of data source  is  bad ,unchangeable so i have to reference it dimension in the query sql. some transformations are done in it too.at last i could not find the memory changes in the log .it allways shows as [INTERNAL - 09:18:58] Mem(M): 0.0 [Peak=0.0] (Ref=0.0 Domain=0.0 Pool=0.0)..i guess that as the data manager can excute some jobs in the meantime.it allocate one job having resource as much as any other jobs. in the database, every query :select or insert.consumes as much as it needs.the process of query-process-insert is worse than query-insert directly doing in the database.
in the end i get a conclusion that the resouce of databuild process is fixed.so we can't tune the databuild setting.we only have to tune the sql.the memory setting(in my case,done dimension reference in the query sql. ) is still a mystery to me.
is my guess right? thanks for pointing out your  ideas.
bluetear

MFGF

It looks like with your initial "data transfer" build, the size of a row of data is not even big enough to register in the memory logging at all - less than 1k. This is quite normal.

When you bring in a dimension, the build is reading and caching the dimension members at the start of the build - you can see this in the "Ref" memory stats. It looks like a very small dimension - about 1k in total - so again this is negligable. Memory is not an issue at all in this build, and we can rule out memory paging and memory breaking as a cause of the poor performance. The time when memory becomes important is when you enable either merging or rejection of duplicate rows, and/or aggregation in your build. The memory model then changes from read a row - process a row - deliver a row to read all rows into memory - process all rows - deliver all rows at the end. You will start to see information on the hash table and the hit rate, and any resizing that occurs, plus memory breaking if you enabled it.

I would expect update or update/insert refresh types to be slower than truncate and append - purely because the build is attempting to update existing rows, which need to be located in order to be updated. If performance is markedly bad, you could try implementing a composite index based on the dimensional keys, which might speed up the update process.

Tuning the SQL is obviously important, but depending how the build is configured and what you are trying to do, it may not be the biggest influence on performance.

Cheers!

MF.
Meep!

bluetear

 the memory set on the fact build is only affected with dimension reference? 
if without any dimension references, the log  show the detail like this :
Mem(M): 0.0 [Peak=0.0]  (Ref=0.0 Domain=0.0 Pool=0.0) .

thanks
bluetear

MFGF

For a build which does not merge or reject duplicate rows on input, which also does not have aggregation enabled, and which does not use any dimension references, the build will read a row of data from the source system into memory, process the row, deliver it to the target system, then read the next row. If each row is (say) 30 bytes, the build will need only 30 bytes of memory to process the rows, and this is way below the 0.1 Megabytes required to make the Pool and therefore the Mem(M) and Peak values register anything other than 0.0. If your rows are 1,000,000 bytes in size then you would see Mem(M): 1.0 [Peak=1.0]  (Ref=0.0 Domain=0.0 Pool=1.0).

If you enable (say) merging of duplicate rows on input, the build will construct a hash table in memory and start assembling and hashing the rows of data in the pool. Even with a row size of 30 bytes, you would probably start seeing increases in the Pool, Peak and Mem figures reported as the rows are assembled in memory during the build.

All of the above is with no dimension references.

If you also bring in dimension references, these are read and cached in memory at the start of the build. The Ref figure will tell you how much memory was needed for this initial caching. The Domain figure will show how much memory was used to keep track of which cached members were used in validation by the build.

If you want to understand more about how Data Manager uses memory, take a look at the Data Manager User Guide:

http://public.dhe.ibm.com/software/data/cognos/documentation/docs/en/10.1.1/ug_ds.pdf

Start at Chapter 29 - Managing memory.

Cheers!

MF.
Meep!

bluetear

thank MF.
I have found the problem of menory.it is the problem of the join query sql. although the sql is a join statement(a big table join a small table),it still  fetchs one row to data manager then process and deliver it.each row is less than 1 Megabytes ,so the mem=0(you have posted).
is that right?




MFGF

Hi,

It isn't a problem - it is the way Data Manager is designed to work. Your SQL query, no matter how simple or complex, will return rows of data for use in the build. As each row is read, it is processed in memory and delivered to the target before the next row is read, so the memory footprint of the build is small - only the size of a single row of data (more or less). If your row size is less than 100k bytes (the smallest measurement of memory the Internal logging feature will register), the memory utilization will show in the log as 0.0.  This is normal and to be expected.

As I mentioned before, this behaviour changes if you enable merging or rejection of duplicate rows on input, and/or if you enable aggregation in one or more of your dimensions. If you do this, the build will read and hash all rows of data in memory, then deliver them all at the end of the build. In this situation you would see much higher memory figures reported in the log file.

Best regards,

MF.
Meep!