[Interest] Cross platform accelerated instructions framework

Thu May 14 17:13:36 CEST 2015

Thanks for all the insights so far. 

I confess i’m a newbie regarding performance optimisation. I’m writing a synthesiser. It’s computing audio on a real time basis and it’s already getting heavy for 8 voices of polyphony. It’s making an iPad 2 work on the limit which is not good, specially when I interact with user interface and it starts glitching.

I think there is a lot of optimisation margin specially regarding to code structuring but i’m not sure.

Maybe first I should be able to optimize the code for maximum performance using the compiler only. I have a lot of encapsulation and i’m not sure if this is good for optimisations. For example, the following function calculates the output of one of the synthesiser voices. Sorry for the long code listing, but maybe someone could point me basic errors i’m doing that will completely compromise compiler optimisations.

Of course that for vectorisation I will need to identify opportunities and refactor the data structure to make the vectorisation possible. But, who knows i’m doing terrible things that could spare me a nice bunch of important CPU cycles?

(this is by far the longest function in the whole program)

// typedef float IAudioSample

IAudioSample IBasicSynthVoice::step()
{
    IAudioSample output=0;
    IAudioSample filterModulation=0;
    IAudioSample pitchModulationSum=0;
    IAudioSample tmp1=0,tmp2=0;

    float eg1 = _eg[0].step();
    float eg2 = _eg[1].step();

    eg1 += eg1*_modWheelMultiplier[MODWHEEL_EG_1];
    eg2 += eg2*_modWheelMultiplier[MODWHEEL_EG_2];

    // applying pitch modulation
    switch (_pitchModulationSource[0])
    {
        case 1:
            tmp1 += _lfo1;
            tmp1 += _lfo1;
            break;
        case 2:
            tmp1 += eg1;
            tmp1 += eg1;
            tmp1 += eg1;
            tmp1 += eg1;
            break;
        case 3:
            tmp1 += _lfo1;
            tmp1 += _lfo1;
            tmp1 += eg1;
            tmp1 += eg1;
            tmp1 += eg1;
            tmp1 += eg1;
            break;
        default:
            break;
    }

    tmp1 *= _pitchModulationAmount[0];

    switch (_pitchModulationSource[1])
    {
        case 1:
            tmp2 += _lfo2;
            tmp2 += _lfo2;
            break;
        case 2:
            tmp2 += eg2;
            tmp2 += eg2;
            tmp2 += eg2;
            tmp2 += eg2;
            break;
        case 3:
            tmp2 += _lfo2;
            tmp2 += _lfo2;
            tmp2 += eg2;
            tmp2 += eg2;
            tmp2 += eg2;
            tmp2 += eg2;
            break;
        default:
            break;
    }

    tmp2 *= _pitchModulationAmount[1];

    pitchModulationSum = (tmp1+tmp2)/12.f;
    pitchModulationSum *= _noteFrequency;

    float _osc1PitchModulation = 0;
    float _osc2PitchModulation = 0;
    float _subPitchModulation = 0;

    switch (_pitchModulationDestination)
    {
        case 1:
            _osc1PitchModulation += pitchModulationSum;
            break;
        case 2:
            _osc2PitchModulation += pitchModulationSum;
            break;
        case 3:
            _osc1PitchModulation += pitchModulationSum;
            _osc2PitchModulation += pitchModulationSum;
            break;
        case 4:
            _subPitchModulation += pitchModulationSum;
            break;
        case 5:
            _osc1PitchModulation += pitchModulationSum;
            _subPitchModulation += pitchModulationSum;
            break;
        case 6:
            _osc2PitchModulation += pitchModulationSum;
            _subPitchModulation += pitchModulationSum;
            break;
        case 7:
            _osc1PitchModulation += pitchModulationSum;
            _osc2PitchModulation += pitchModulationSum;
            _subPitchModulation += pitchModulationSum;
            break;
    }

    if (_pitchBendDestination[PITCHBEND_OSC_1])
    {
        _osc1PitchModulation += _pitchBendMultiplier*_noteFrequency;
        _subPitchModulation += _pitchBendMultiplier*_noteFrequency;
    }

    if (_pitchBendDestination[PITCHBEND_OSC_2])
    {
        _osc2PitchModulation += _osc2.frequency()*_pitchBendMultiplier;
    }

    _osc1.setModulation(_osc1PitchModulation);
    _osc2.setModulation(_osc2PitchModulation);
    _sub.setModulation(_subPitchModulation);

    float sub = _sub.step();
    float osc1 = _osc1.step();
    float osc2 = _osc2.step();

    if (_osc2Sync && _osc1.sync())
        _osc2.setPhase(0);

    float ring = osc1*osc2;

    // FM
    //_osc1.setModulation(osc2*_crossModulationAmount*2500);

    // mixer
    output = (ring*_ringAmount);
    output += (osc1*_osc1Volume);
    output += (osc2*_osc2Volume);
    output += (sub*_subVolume);
    output += (_noise);

    _saturator.process(&output, &output);

    //calculateFilterModulation(eg2, osc2);
    // applying filter modulation

    // modulation amount - eg2
    filterModulation += eg2*_filterModulationAmount[0]*(1+_filterModulationAmount[5]*_velocity);

    // filter modulation amount - 1 - lfo1
    filterModulation += _lfo1*_filterModulationAmount[1]*_filterModulationAmount[1];

    // filter modulation amount - 2 - lfo2
    filterModulation += _lfo2*_filterModulationAmount[2]*_filterModulationAmount[2];

    // filter modulation amount - 3 - vco2
    filterModulation += osc2*_filterModulationAmount[3];

    // filter modulation amount - 4 - kbd

    if (_pitchBendDestination[PITCHBEND_FILTER])
        filterModulation += (powf(2, _pitchBendRange*_pitchBend)-1);

    _filter.setKeyboardMultiplier(_kbdFilter);
    _filter.setModulation(filterModulation);

    _filter.process(&output, &output);

    // filter modulation amount - 5 - vel

    // vca modulation - eg1, eg2, kbd
    output *= (eg1*_ampModulationAmount[2]+eg2*_ampModulationAmount[3])*_kbdFilter;
    output *= (1+_ampModulationAmount[5]*_velocity);

    float ampModulationSum = 0;

    // vca modulation - lfo1
    ampModulationSum += _lfo1*_ampModulationAmount[0];

    // vca modulation - lfo2
    ampModulationSum += _lfo2*_ampModulationAmount[1];

    if (ampModulationSum>1.5)
        ampModulationSum=1.5;

    if (ampModulationSum<-1.5)
        ampModulationSum=-1.5;

    output -= output*ampModulationSum;

    return output;
}

Nuno Santos

> On 14 May 2015, at 13:52, Allan Sandfeld Jensen <kde at carewolf.com> wrote:
> 
> To write in a way that the compiler can auto-vectorize, write the CPU 
> intensive work in simple inner loops without function calls (or only inlined 
> ones), use no array access by anything other than the index counter, and also 
> avoid branches as much as possible. If you do need branches, write them as 
> using conditional assign with c ? a : b.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/interest/attachments/20150514/277b60c6/attachment.html>