Time-series and correlation strategies


Time-series and correlation strategies



I have various time-series I'd like to correlate and present as either a csv-file or in-memory datatable (.NET). These time-series are arrays of time-value-pairs (actually these are objects containing more than just time and value). The time-series may span across different overlapping periods, and some might even have holes (missing values for given timestamps).

For those interested, I'm using the OPC HDA .NET library to extract historic time-series from an OPC HDA server.

The resulting datatable should have one column for each time-series all in chronological order based on a timestamp column. See example below:

|-------|-------|-------|-------|-------|    TIME    TS1     TS2     TS3     TS4 |-------|-------|-------|-------|-------|     1       X               X       X |-------|-------|-------|-------|-------|     2       X       X       X       X |-------|-------|-------|-------|-------|     3       X       X               X |-------|-------|-------|-------|-------|     4       X       X       X  |-------|-------|-------|-------|-------|     5       X       X       X  |-------|-------|-------|-------|-------| 

What would be the most effective way of achieving this? With "effective" I mean with the least amount of code. But considering that the timeseries could become quite large, memory usage might also be an issue.




How would you remove elements of a std::vector based on some property of the elements?

1:



How can I implement an algorithm for multicore in Java?
You might go with a data structure like a nested dictionary and iterate over the contents:.
Securing AJAX Requests via GUID
Dictionary <TimeSeries, Dictionary<DateTime, Value>> dict = new Dictionary<TimeSeries, Dictionary<DateTime, Value>>();  foreach (TimeSeries series in dict.Keys) {      //table row output code goes here     Dictionary<DateTime, Value> innerDict = dict[series];     foreach (DateTime date in innerDict.Keys) {         Value seriesValueAtTimeT = innerDict[date];         //table column output code goes here     } } 
Where your output code is writing out to something else, depending on your needs, and you replace the datatypes TimeSeries, Value, etc., with your actual data types..
Factorial in C without conditionals, loops and arithmetic operators


Is Multiplying the Inverse Better or Worse?


Longest contiguous subsequence algorithm

2:



Normal Distribution function
You can first scan all present series for the distinct values (for example, aggregating them in a HashSet), then simply dump them into an array of dates (storing a match between date and index position in a dictionary).


I am looking for a content addressing data structure
.
var distinctDates = allSeries   .SelectMany(s => s.Values.Select(v => v.Date))   .Distinct()   .OrderBy(d => d)   .ToArray();  var datePositions = distinctDates   .Select((d,index) => new      {       Date = d,       Index = index     }). 

.ToDictionary(x => x.Date, x => x.Index);
Then, create a jagged array that has width of "NumberOfSeries" and length of "NumberOfDates".

After that, do a second scan of all the data and dump them to their positions..
var values = new float[allSeries.Length][]; for (var i=0;i<allSeries.Length;i++) {   values[i] = new float[distinctDates.Length];   var currentSerie = allSeries[i];   foreach(var value in currentSerie.Values)   {     var index = datePositions[value.Date];     values[i][index] = value.Value;   }       } 
I wrote this code without touching VisualStudio, so I may have a few typos.

Or there may be used a few LINQ methods that are not present in the .NET (just look in Lokad.Shared.dll).

But you should be able to get the idea.. Some more notes, while I'm at the topic:.
  1. Go for the jagged array, if you have to keep everything in the memory at once.

    It is way more efficient than a dictionary and has a lot less memory problems than a rectangular array..
  2. Keep Value objects as small as possible (i.e.: float instead of double)..
  3. If number of time serie values is expected to go large in the future, then never store values in database in a "one row per value".

    It is recommended to either go for something like HDF (which has a .NET interface) or use persist time serie fragments in binary form in DB (as in time serie databases).
Sticking to these should allow you to scale up to hundreds of millions of time values without a lot of problems (done that)..


3:


Hey Chris.

I realize you've already accepted an answer, but thought I'd post a solution I use.

Hopefully it's of some use to someone.

If not at least it provides a place for me to find it sometime in the future.

:-). This is VBA code fired off directly from an Excel 2007 code module.

It could be easily converted to .Net.

. The key for the data manipulation is the pivot table object.

I've found it to be massively efficient at getting the data into the layout you've specified..
Sub GetIndexData () Dim cn as ADODB.Connection, cmd As ADODB.Command, rs As ADODB.Recordset Dim rPivotTopLeft As Range, rPivotBottomRight As Range  Application.ScreenUpdating = False Application.Calculation = xlCalculationManual  'Get the data.' Set cn = New ADODB.Connection With cn   .Provider = "SQLOLEDB"   .ConnectionString = "Database=" & mDBName & ";" & _                       "Server=" & mDBServerName & ";" & _                       "UID=" & mDBUserID & ";" & _                       "Password=" & mDBPassword & ";" & _                       "Persist Security Info=True;"   .CursorLocation = adUseClient   .Open End With  Set cmd = New ADODB.Command Set rs = New ADODB.Recordset With cmd   .ActiveConnection = adoTools.DBConnection   .CommandType = adCmdText   .CommandText = "SELECT YourData From YourSource WHERE YourCritera"   Set rs = .Execute End With    If Not (rs.BOF And rs.EOF) Then 'Check that we have some data.'  'Put the data into a worksheet.' With wsRawData   .Cells.CurrentRegion.Clear    Set rPivotTopLeft = .Range("A1")   With ThisWorkbook.PivotCaches.Add(SourceType:=xlExternal)     Set .Recordset = rs     .CreatePivotTable _         TableDestination:=rPivotTopLeft, _         TableName:="MyPivotTable"   End With    'Massage the data into the desired layout.'   With .PivotTables("MyPivotTable")     .ManualUpdate = True      .PivotFields("Date").Orientation = xlRowField     .PivotFields("Index").Orientation = xlColumnField     .AddDataField .PivotFields("Return"), "Returns", xlSum      .DisplayFieldCaptions = False     .ColumnGrand = False     .RowGrand = False      .ManualUpdate = False   End With    mMonthCount = Range(.Range("A3"), .Cells(Rows.Count, "A").End(xlUp)).Count   mIndexCount = Range(.Range("B2"), .Cells(2, Columns.Count).End(xlToLeft)).Count    'Convert pivot table to values.'   Set rPivotBottomRight = .Cells(mMonthCount + 2, mIndexCount + 1)   With .Range(rPivotTopLeft, rPivotBottomRight)     .Copy     .PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks:=False, Transpose:=False   End With    'Format the worksheet.'   .Range("A3").Resize(mMonthCount, 1).NumberFormat = "mmm-yy"   .Range("B3").Resize(mMonthCount, mIndexCount).NumberFormat = "0.00%"   Union(.Rows(2), .Columns(1)).Font.Bold = True   .Cells.ColumnWidth = 7.14   .Rows(1).Delete  End With   rs.close Set rs = Nothing cmd.ActiveConnection = Nothing Set cmd = Nothing cn.close Set cn = Nothing  End Sub 
From there its relatively easy to leverage the built in excel regression statistics to output a correlation matrix.

With this technique I'm producing a worksheet with a 600x600 matrix of correlations in about 45 seconds.. Note that the .PivotFields parameters should be changed to fit the column names of your data from your data source..


4:


One thing you can do if memory usage becomes a problem is to move from tracking single events to tabulated counts of an event within a given time period.

You lose some precision about exactly when things happened, but often summarizing your data like this can reduce the complexity of your picture and make trends more obvious.. In case it's not obvious, I mean taking data that looks like this:.
12:00 event1 12:01 event2 12:10 event1 12:11 event1 
to this:.
12:00-12:15 event1 3 12:00-12:15 event2 1 



85 out of 100 based on 75 user ratings 1225 reviews