SwedAPL 2016 Copenhagen Roundup
It was a short hop down from the Turning Torso in Malmö and across the Öresund Bridge to Copenhagen for SwedAPL 2016 held on 1st April and hosted by SimCorp A/S.
SwedAPL as an event is burgeoning and April 2016 was the largest meeting yet. Around thirty people attended in person, another ten joined the online event streamed via GoToMeeting, some from as far afield as North America.
The theme for this meeting was ‘Concurrency in APL’.
Presentation 1: Need for Speed
by Morten Kromberg, Dyalog
More and more analyses on bigger and bigger data against a background whereby the increase computer performance is slowing down. Increasing clock speeds facilitated by ever smaller transistor sizes is reaching a physical limit whereby the transistor gates are starting to leak current even when not switching.
Morten contended that the solution to getting faster lay in parallelism. Instead of more transistors in the CPU have more CPUs.
Two alternative hardware configurations were described; the (GP)GPU, multiple cores single instruction multiple data, well suited to data parallelism; Intel’s “knights Landing” where multiple miniaturised Pentium cores that can run legacy Intel code. The latter is more expensive at present.
Other alternatives could be to use lots of boxes or rent cloud based machines (e.g. from amazom) with lots of memory and cores.
Data parallelism where the data flows are computed and can be mapped to the GPU to maximise the use of cache. Task parallelism for example and an array of namespaces which are passed arrays of data and execute the same or different functions finishing at the same or different times. Task parallelism requires distribution and collection of data and management of the processes for efficiency.
Moving data is the challenge, may have parallel cores but memory is serial, wait states become an issue and transfers to remote machines are even slower. Finding a general solution is difficult so Dyalog’s approach has been to help the user to define where the parallelism takes place. Dyalog has taken a number of steps to aid performance and concurrency.
- Tuning the interpreter for Dyalog 15.0 the C compiler has been upgraded on Windows, Linux and AIX. This will be the first release where the QA performance comparison shows improvement across the board.
- Co-dfns compiler for dfns only this is data parallelism. Still experimental but demonstration benchmarks showed improvements of five or six times for large arrays. This work has been underway for about three years and is beginning to show fruit.
- Optimised byte code execution that pre-parses to determine most efficient optimisations which gives around 2 times speed up on small arrays. There is potential for more sophisticated optimisations. Dyalog 15 has 83.3% coverage of optimised byte code.
- Futures and isolates is aimed at allowing the user to specify task parallelism. Essentially a namespace where expressions run in another processor in parallel to the main thread.
- APLProcess and RPCServer
- CONGA providing asynchronous communication; transfer APL arrays between Dyalog sessions; secure and encrypted security; integrated windows authentication.
- Support for Amazon Elastic Compue Cloud (EC2) to run simulations with vecdb.
- Dyalog has written its own asynchronous applications TryAPL; DFS a partitioned file system with multiple users; vecdb is a vector database written in Dyalog as an open source project https://github.com/Dyalog/vecdb maps APL arrays one column to one file.
Presentation 2: Concurrency in SimCorp
By Stig Neilsen, SimCorp
A short presentation of SimCorp’s experiences experimenting in using parallel language features. With load balancing mechanisms in place for heavy calculations distributing effort across multiple machines; and few single line expressions doing heavy number crunching; the focus was on tasks that were too small for the load balancing mechanisms but which still took too much time.
With a large legacy system the approach was to implement ‘worker bees’ by attaching external workspaces to isolates. In this way large isolates are created with shared memory and the footprint reduced. However, the isolate is then a sub-namespace within the workspace, the consequence of this is the calls are complicated in that a function in the external workspace is in the parent of the isolate namespace.
A short demonstration showed how calls to initialize the isolates needed to capture the result to avoid creating a future and ensure execution was complete to avoid errors late in the process.
Once initialised a comparison executing four instances of an expression doing under each and within isolates showed the four fold improvement one would hope for. Another example showed how performance gains could be had by partitioning the text search through files across isolates.
Both isolates and external workspaces are experimental technologies. The biggest problem has been finding good business cases to work with and the time to do it, but Stig was in little doubt that there were benefits to be gained from parallel concepts.
Presentation 3: Managing slave tasks in APL+Win
By Stephen Taylor
Stephen described SUPERVAL, the application he was working with and the small team supporting it. The application started as the private work of a domain expert with the purpose of evaluating the liabilities of defined pension funds. Stephen’s role is acting as a technical expert with a brief to introduce new technologies and keep things moving without ‘frightening’ his colleagues and in a way in which they can work with.
SUPERVAL has very long, serial and repetitive calculations ideal candidates for parallelisation. The work carried out and being demonstrated was as proof of concept to see what could be done with the minimum of work in the simplest possible code.
The process targeted was one where a valuation calculation process writes values to a CSV file that may be 2Gb+ in size; a subsequent keyed summation process is performed to generate a smaller results file of around 1mb; these results are then written to a database. The CSV files consist of two columns, an encoded multiple key and a value column.
The approach adopted was to take the summarisation process out of the main process and carry it out as a parallel operation. The answer to controlling this process was to adopt a master-slave configuration; Because of the data volumes it would be passed by CSV also chosen for robustness; though APL+WIN OCX controls for running a slave task were deemed unsuitable for the data volumes involved but would be fine to use as a control channel.
TCP/IP was considered but rejected because, despite the promised speed and connectivity features, this would have required a lot of work to make the process robust, particularly with the data volumes involved.
The master launches the slave which has no latent expression, so it can behave multi-functional as the master controls what is executed. The OCX control gives hands on control from the master task to the slave which in turn makes debugging easier.
In process the master keeps writing and the slave keeps reading and performing the summation. The master then sets flags when writing is complete and signals the slave to finish. When slave task has concluded the summation and written the results then the master can just kill it off. Should the master fail the OS will kill the slave so there are no orphan operations left lying around.
A short practical demonstration followed of the process in action. Stephen concluded that if the trials proved successful then a much larger implementation was planned.
Presentation 4: Synchronising Sofia
By Klaus Klug Christiansen, APL Italiana
Klaus described having a WOW experience when finding how easy it was to handle tokens in Dyalog APL. But before going into that he outlined other strategies he had tried for multi-threaded coding.
Writing one-liners kept the operations atomic avoiding thread switching but this approach is difficult to debug and an undesirable way to write large applications. Using :Hold token_id was a possibility but you quickly realise that this is serialized so there is only one thread running. Using tokens to manage threads was difficult. Handling the token pool is tricky and not easy to track what was in use and what could be recycled.
The WOW moment came when finding in Sofia token handling software to control token use and recycling and it was this strategy that Klaus outlined.
The ‘Synchronisation tool box’ handles the token pool and provides the tool for the programmer to handle synchronisation. There are different types; latches, gates, synchronisation objects, FIFO queues and read/write locks; these types are allocated to ranges of token numbers.
The ‘latch’ has a worker thread that launches another thread that enters a loop and waits on the latch object. The parent thread ‘opens’ the latch and the child performs an operation then waits on the open latch. The ‘gate’ is a latch where the worker thread continues to run until the parent thread closes the gate. Tokens can contain data so easy to make a ‘synchronisation object’ like the latch it is a single operation but once complete the data is returned so synchronisation in the parent is easy. The token pool is ‘FIFO’ so the queue is easy to manage as a set of multiple positive tokens which are handled sequentially. ‘Read/write’ lock are handle by pairs of negative and positive token pairs that switch between wait and release states for read write operations.
Presentation 5: Promises in APL
By Gilgamesh Athoraya, Data Analytics Sweden
The promise object is also providing a mechanism to chain calls using a dyadic instance operator. The operator queues functions to be called once the promise has resolved. It also queues error handlers that are invoked when a function call signals an error.
A discussion followed where Morten from Dyalog suggested a variation on the ampersand operator that could return a future instead of the thread number as now. He also talked about the ability to create immediate futures, without necessarily linking them to function calls. The future could then be resolved independently at a later stage.
Polyominoes in APL
By John Niss Hansen
John Niss Hansen demonstrated his tool for building and solving polyominoes which he has been working on since 1996. He has written the entire app in Dyalog APL, including the rendering of the puzzles in 3d space.
The application is a perfect candidate for parallel execution as it explores a large amount of combinations to solve a given puzzle. John showed us how his application lets the user navigate and interact while queries are run in separate threads.
Although difficult to visualize, the application handles puzzles in higher dimensions.
Workshop: Isolates & Futures
With Morten Kromberg, Dyalog
Morten described deterministic parallelism where inserting or removing parallel operators has no effect on the notation. But watch out for side-effects and errors. If a future is not consumed, then errors in the execution of its associated isolated will not be revealed.
A practical demonstration followed showing a variety of ways of creating isolates with isolate.New, from an empty namespace and then assigning values or fixing functions into the isolate; from a namespace already populated with functions and variables; by passing a list of names; or with a simple string that is assumed to be a workspace name.
Morten illustrated creating an array of isolates; spilt a matrix and used a distributive assignment to pass the partitioned values into the isolates; then executed an expression to return a result from each isolate.
Showed that expressions returning futures are only executed when the future was referenced. The ‘futures’ of distributed calls to the set of isolates (e.g z←iss.⎕DL 3 6 9) could be assigned to a variable and its structure examined. However, when the values were retrieved it became clear that the result depended on the longest delay, but the individual futures are independent.
Using a more substantial calculation Morten contrasted the difference between parallel each and regular by showing that with the former it was possible to generate 100% CPU usage as all processors were used by the isolates.
The isolate.Config revealed settable options such as the number of ports and processors, on-error action etc. including a start-up workspace the default being the Dyalog distributed isolate.